Re: conchuela timeouts since 2021-10-09 system upgrade

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade
Date: 2021-11-19 04:09:04
Message-ID: CA+hUKG+jTMdxkyZJbUGyG7ggFKABWsBwSxSzLNVm-X87DeLfqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I managed to reproduce the 017_shm.pl hang using my Vagrant file (see
earlier). It happens for "gmake -s check-world", but not "gmake -s -C
src/test/recovery check", which might have confused me last time. The
symptoms are as described already, though this time I learned with
truss that it's in a retry loop waiting 1 second at a time.

I don't know how to get debug symbols for libc on this operating
system (well, probably needed to select debug option while installing
the OS, but I didn't install the OS, I'm using a lazy bones pre-rolled
Vagrant image). So in order to be able to inspect the arguments to
select, I made my own LD_PRELOAD wrapper for select (see end), and I
observed that it was calling select() with an empty fd_set:

(gdb) bt
#0 0x00000008014c1e6c in select () from /lib/libc.so.8
#1 0x00000008009f0fe6 in select () from /usr/lib/libpthread.so.0
#2 0x000000080044c622 in select (nfds=16, readfds=0x8017f6de0,
writefds=0x8017f6d50, exceptfds=0x0,
timeout=0x7fffffdfd4a0) at select.c:13
#3 0x00000008007bfaf9 in Perl_pp_sselect () from
/usr/local/lib/perl5/5.32/mach/CORE/libperl.so.5.32
#4 0x000000080076b036 in Perl_runops_standard ()
from /usr/local/lib/perl5/5.32/mach/CORE/libperl.so.5.32
#5 0x00000008006da3b7 in perl_run () from
/usr/local/lib/perl5/5.32/mach/CORE/libperl.so.5.32
#6 0x0000000000400de4 in main ()
(gdb) f 2
#2 0x000000080044c622 in select (nfds=16, readfds=0x8017f6de0,
writefds=0x8017f6d50, exceptfds=0x0,
timeout=0x7fffffdfd4a0) at select.c:13
13 return real_function(nfds, readfds, writefds, exceptfds, timeout);
(gdb) print nfds
$1 = 16
(gdb) print *readfds
$2 = {fds_bits = {0 <repeats 16 times>}}
(gdb) print *writefds
$3 = {fds_bits = {0 <repeats 16 times>}}

So it looks a lot like something on the perl side has lost track of
the pipe it's supposed to be selecting on. If I understand correctly,
it's supposed to be waiting for one of the following pipes to appear
as readable, whichever is the one that the zombie psql process
previously held the other end of:

$ fstat -p 73032
USER CMD PID FD PATH INUM MODE
SZ|DV R/W
vagrant perl 73032 root / 1 drwxr-xr-x
offset:0 r
vagrant perl 73032 wd
/home/vagrant/postgres/src/test/recovery 4434524058 drwxr-xr-x
offset:0 r
vagrant perl 73032 text /pfs/@@-1:00001/local/bin/perl
4329919842 -rwxr-xr-x offset:0 r
vagrant perl 73032 0 /dev/pts/0
1335 crw--w---- pts/0:155171 rw
vagrant perl 73032 1
/home/vagrant/postgres/src/test/recovery/tmp_check/log/regress_log_017_shm
4477427235 -rw-r--r-- offset:9457 w
vagrant perl 73032 2
/home/vagrant/postgres/src/test/recovery/tmp_check/log/regress_log_017_shm
4477427235 -rw-r--r-- offset:9457 w
vagrant perl 73032 3
/home/vagrant/postgres/src/test/recovery/tmp_check/log/regress_log_017_shm
4477427235 -rw-r--r-- offset:9457 w
vagrant perl 73032 4* pipe fffff800aac1dca0 (B<->A) ravail 0
wavail 0 rw
vagrant perl 73032 5* pipe fffff800ab6e9020 (B<->A) ravail 0
wavail 0 rw
vagrant perl 73032 6* pipe fffff800aac1dca0 (B<->A) ravail 0
wavail 0 rw
vagrant perl 73032 7* pipe fffff800ab6e9020 (B<->A) ravail 0
wavail 0 rw
vagrant perl 73032 8* pipe fffff800aac1dca0 (B<->A) ravail 0
wavail 0 rw
vagrant perl 73032 9* pipe fffff800ab6e9020 (B<->A) ravail 0
wavail 0 rw
vagrant perl 73032 13* pipe fffff800ab157560 (B<->A) ravail 0
wavail 0 rw

=== select wrapper ===

$ cat select.c
#include <dlfcn.h>
#include <sys/select.h>

int
select(int nfds,fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
struct timeval *timeout)
{
static int (*real_function)(int, fd_set *, fd_set *, fd_set *,
struct timeval *);

if (!real_function)
real_function = dlsym(RTLD_NEXT, "select");

return real_function(nfds, readfds, writefds, exceptfds, timeout);
}

$ cc -Wall -fPIC -shared -g -o myselect.so select.c -ldl
$ cd postgres
$ LD_PRELOAD=/home/vagrant/myselect.so gmake check-world -s

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Дмитрий Иванов 2021-11-19 04:45:45 Re: pg_restore depending on user functions
Previous Message Peter Geoghegan 2021-11-19 02:55:08 Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum