Re: occasional ECPG failures on dikkop (FreeBSD)

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: occasional ECPG failures on dikkop (FreeBSD)
Date: 2026-06-29 22:46:13
Message-ID: d92b18e8-00aa-4c32-95ff-7d807a427481@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/6/26 16:15, Tom Lane wrote:
> Tomas Vondra <tomas(at)vondra(dot)me> writes:
>> Ah, I haven't noticed the last couple messagess, mentioning dikkop. Not
>> sure if it's the same issue - I guess we'll have to wait if the failures
>> go away once it gets fixed.
>
>> Still, it seems like a long-standing issue. It's curious dikkop started
>> failing only very recently, after the OS upgrade.
>
> Yeah. Given that these are threading tests, I'm suspecting some
> change in thread scheduling behavior in this latest FBSD release,
> which somehow made it easier to hit a pre-existing issue.
>

The failures on dikkop continued, and I got annoyed by that. So I spent
a bit of time investigating this today. I managed to reproduce the
crashes in a way that allowed me to grab backtraces from the core files
(which we don't seem to get in the buildfarm). And all of them generally
looked like this:

--------------------------------------------------------------------
Program terminated with signal SIGSEGV, Segmentation fault.
Address not mapped to object.
#0 __vfprintf (fp=fp(at)entry=0x889dab00, locale=locale(at)entry=0x841dfcf0
<__xlocale_global_locale>, serrno=0, fmt0=fmt0(at)entry=0x20087a
"Connection: %d", ap=...) at /usr/src/lib/libc/stdio/vfprintf.c:477

warning: 477 /usr/src/lib/libc/stdio/vfprintf.c: No such file or directory
[Current thread is 1 (LWP 933707)]
(gdb) bt
#0 __vfprintf (fp=fp(at)entry=0x889dab00, locale=locale(at)entry=0x841dfcf0
<__xlocale_global_locale>, serrno=0, fmt0=fmt0(at)entry=0x20087a
"Connection: %d", ap=...) at /usr/src/lib/libc/stdio/vfprintf.c:477
#1 0x000000008411d434 in vsprintf_l (locale=0x841dfcf0
<__xlocale_global_locale>, ap=..., str=<optimized out>, fmt=<optimized
out>) at /usr/src/lib/libc/stdio/vsprintf.c:61
#2 vsprintf (str=0x889daefc "", fmt=<optimized out>, ap=...) at
/usr/src/lib/libc/stdio/vsprintf.c:68
#3 0x00000000841123f4 in sprintf (str=0x0, str(at)entry=0x889daefc "",
fmt=0x841dfcf0 <__xlocale_global_locale> "") at
/usr/src/lib/libc/stdio/sprintf.c:56
#4 0x0000000000210ed8 in fn (arg=0x0) at prep.pgc:39
#5 0x00000000847844fc in thread_start (curthread=0x63044546d010) at
/usr/src/lib/libthr/thread/thr_create.c:291
#6 0x0000000000000000 in ?? ()
--------------------------------------------------------------------

Which is a crash in snprintf, or rather vfprintf on this line:

decpt_len = (decimal_point[1] == '\0' ? 1 : strlen(decimal_point));

That's a bit bizarre, because I thought snprintf is meant to be thread
safe. After staring at this for a while and getting nowhere, I decided
to make sure that's actually true, by doing this:

--------------------------------------------------------------------
#include <stdio.h>
#include <pthread.h>

void *fn(void *arg) {
char buf[128];
for(int i=0; i<10000; i++)
sprintf(buf, "%d", 1);
return NULL;
}

int main() {
pthread_t th[100];

for(int i=0; i<100; i++)
pthread_create(&th[i], NULL, fn, NULL);

for(int i=0; i<100; i++)
pthread_join(th[i], NULL);

return 0;
}
--------------------------------------------------------------------

And after running that a in a loop for a couple seconds ...

$ while /usr/bin/true; do ./test; done;
Segmentation fault (core dumped)

The backtrace is exactly the one I posted earlier. Turns out snprintf()
is not as thread-safe as I thought.

AFAIK the culprit it the line right before the crash:

decimal_point = localeconv_l(locale)->decimal_point;
/* The overwhelmingly common case is decpt_len == 1. */
decpt_len = (decimal_point[1] == '\0' ? 1 : strlen(decimal_point));

The localeconv_t() rebuilds some sort of cache, protected by an atomic
flag. But it uses atomic_store_int(), which is a "relaxed" variant:

From "atomic" manpage:

... The first, or relaxed variant, performs the operation without
imposing any ordering constraints on accesses to other memory
locations. ...

So localeconv_t() updates some fields (e.g. "decimal_point" pointer),
and sets the flag. But there's no guarantee in which order will other
threads see these updates. A thread may see the new "flag" value, but
not the decimal_point value. That seems ... not great?

I suppose this only affects architectures like ARM/AArch64, which is
what dikkop runs on (rpi5). Which would explain why we don't see this on
the x86_64 animals. opaleye didn't hit it either, but it has not run for
months, and it didn't dikkop only started hitting it ~3 months ago.

As for a fix, it seems enough to populate the cache before starting the
threads, e.g. by doing a sprintf().

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2026-06-29 23:01:30 Re: Report bytes and transactions actually sent downtream
Previous Message Melanie Plageman 2026-06-29 22:32:47 Missing FSM Update when Updating VM On-access