Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: mark(at)2ndquadrant(dot)com, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-10-11 15:45:31
Message-ID: 20032.1570808731@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.

FWIW, I'm not excited about that as a permanent solution. It requires
root privilege, and it affects the whole machine not only the buildfarm,
and making it persist across reboots is even more invasive.

> I have done the same on this machine. wobbegong runs every hour, so
> let's see what happens next. With any luck the buildfarm will give us a
> stack trace without needing further action.

I already collected one manually. It looks like this:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
5114 {
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26-30.fc27.ppc64le
(gdb) bt
#0 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#1 <signal handler called>
#2 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#3 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:3215
#4 <signal handler called>
#5 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#6 0x00000000103f9f98 in sigusr1_handler (postgres_signal_arg=<optimized out>)
at postmaster.c:5275
#7 <signal handler called>
#8 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#9 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:3215
#10 <signal handler called>
#11 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#12 <signal handler called>
#13 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#14 0x00000000103f9f98 in sigusr1_handler (postgres_signal_arg=<optimized out>)
at postmaster.c:5275
#15 <signal handler called>
#16 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#17 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:3215
...
#572 <signal handler called>
#573 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#574 0x00000000103f9f98 in sigusr1_handler (
postgres_signal_arg=<optimized out>) at postmaster.c:5275
#575 <signal handler called>
#576 0x00007fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#577 0x00000000103fad08 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:3215
#578 <signal handler called>
#579 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#580 <signal handler called>
#581 0x00007fff93a01514 in select () from /lib64/libc.so.6
#582 0x00000000103f7ad8 in ServerLoop () at postmaster.c:1682
#583 PostmasterMain (argc=<optimized out>, argv=<optimized out>)
at postmaster.c:1391
#584 0x0000000000000000 in ?? ()

What we've apparently got here is that signals were received
so fast that the postmaster ran out of stack space. I remember
Andres complaining about this as a theoretical threat, but I
hadn't seen it in the wild before.

I haven't finished investigating though, as there are some things
that remain to be explained. The dependency on having
force_parallel_mode = regress makes sense now, because the extra
traffic to launch and reap all those parallel workers would
increase the stress on the postmaster (and it seems likely that
this stack trace corresponds exactly to alternating launch and
reap signals). But why does it only happen during the pg_upgrade
test --- plain "make check" ought to be about the same? I also
want to investigate why clang builds seem more prone to this
than gcc builds on the same machine; that might just be down to
more or less stack consumption, but it bears looking into.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyle Bateman 2019-10-11 17:58:50 Connect as multiple users using single client certificate
Previous Message Andrew Gierth 2019-10-11 15:28:15 Re: PostgreSQL, C-Extension, calling other Functions