| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: occasional ECPG failures on dikkop (FreeBSD) |
| Date: | 2026-06-29 22:46:13 |
| Message-ID: | d92b18e8-00aa-4c32-95ff-7d807a427481@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 5/6/26 16:15, Tom Lane wrote:
> Tomas Vondra <tomas(at)vondra(dot)me> writes:
>> Ah, I haven't noticed the last couple messagess, mentioning dikkop. Not
>> sure if it's the same issue - I guess we'll have to wait if the failures
>> go away once it gets fixed.
>
>> Still, it seems like a long-standing issue. It's curious dikkop started
>> failing only very recently, after the OS upgrade.
>
> Yeah. Given that these are threading tests, I'm suspecting some
> change in thread scheduling behavior in this latest FBSD release,
> which somehow made it easier to hit a pre-existing issue.
>
The failures on dikkop continued, and I got annoyed by that. So I spent
a bit of time investigating this today. I managed to reproduce the
crashes in a way that allowed me to grab backtraces from the core files
(which we don't seem to get in the buildfarm). And all of them generally
looked like this:
--------------------------------------------------------------------
Program terminated with signal SIGSEGV, Segmentation fault.
Address not mapped to object.
#0 __vfprintf (fp=fp(at)entry=0x889dab00, locale=locale(at)entry=0x841dfcf0
<__xlocale_global_locale>, serrno=0, fmt0=fmt0(at)entry=0x20087a
"Connection: %d", ap=...) at /usr/src/lib/libc/stdio/vfprintf.c:477
warning: 477 /usr/src/lib/libc/stdio/vfprintf.c: No such file or directory
[Current thread is 1 (LWP 933707)]
(gdb) bt
#0 __vfprintf (fp=fp(at)entry=0x889dab00, locale=locale(at)entry=0x841dfcf0
<__xlocale_global_locale>, serrno=0, fmt0=fmt0(at)entry=0x20087a
"Connection: %d", ap=...) at /usr/src/lib/libc/stdio/vfprintf.c:477
#1 0x000000008411d434 in vsprintf_l (locale=0x841dfcf0
<__xlocale_global_locale>, ap=..., str=<optimized out>, fmt=<optimized
out>) at /usr/src/lib/libc/stdio/vsprintf.c:61
#2 vsprintf (str=0x889daefc "", fmt=<optimized out>, ap=...) at
/usr/src/lib/libc/stdio/vsprintf.c:68
#3 0x00000000841123f4 in sprintf (str=0x0, str(at)entry=0x889daefc "",
fmt=0x841dfcf0 <__xlocale_global_locale> "") at
/usr/src/lib/libc/stdio/sprintf.c:56
#4 0x0000000000210ed8 in fn (arg=0x0) at prep.pgc:39
#5 0x00000000847844fc in thread_start (curthread=0x63044546d010) at
/usr/src/lib/libthr/thread/thr_create.c:291
#6 0x0000000000000000 in ?? ()
--------------------------------------------------------------------
Which is a crash in snprintf, or rather vfprintf on this line:
decpt_len = (decimal_point[1] == '\0' ? 1 : strlen(decimal_point));
That's a bit bizarre, because I thought snprintf is meant to be thread
safe. After staring at this for a while and getting nowhere, I decided
to make sure that's actually true, by doing this:
--------------------------------------------------------------------
#include <stdio.h>
#include <pthread.h>
void *fn(void *arg) {
char buf[128];
for(int i=0; i<10000; i++)
sprintf(buf, "%d", 1);
return NULL;
}
int main() {
pthread_t th[100];
for(int i=0; i<100; i++)
pthread_create(&th[i], NULL, fn, NULL);
for(int i=0; i<100; i++)
pthread_join(th[i], NULL);
return 0;
}
--------------------------------------------------------------------
And after running that a in a loop for a couple seconds ...
$ while /usr/bin/true; do ./test; done;
Segmentation fault (core dumped)
The backtrace is exactly the one I posted earlier. Turns out snprintf()
is not as thread-safe as I thought.
AFAIK the culprit it the line right before the crash:
decimal_point = localeconv_l(locale)->decimal_point;
/* The overwhelmingly common case is decpt_len == 1. */
decpt_len = (decimal_point[1] == '\0' ? 1 : strlen(decimal_point));
The localeconv_t() rebuilds some sort of cache, protected by an atomic
flag. But it uses atomic_store_int(), which is a "relaxed" variant:
From "atomic" manpage:
... The first, or relaxed variant, performs the operation without
imposing any ordering constraints on accesses to other memory
locations. ...
So localeconv_t() updates some fields (e.g. "decimal_point" pointer),
and sets the flag. But there's no guarantee in which order will other
threads see these updates. A thread may see the new "flag" value, but
not the decimal_point value. That seems ... not great?
I suppose this only affects architectures like ARM/AArch64, which is
what dikkop runs on (rpi5). Which would explain why we don't see this on
the x86_64 animals. opaleye didn't hit it either, but it has not run for
months, and it didn't dikkop only started hitting it ~3 months ago.
As for a fix, it seems enough to populate the cache before starting the
threads, e.g. by doing a sprintf().
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2026-06-29 23:01:30 | Re: Report bytes and transactions actually sent downtream |
| Previous Message | Melanie Plageman | 2026-06-29 22:32:47 | Missing FSM Update when Updating VM On-access |