Re: pgsql: Make new crash restart test a bit more robust.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Make new crash restart test a bit more robust.
Date: 2017-09-20 01:06:29
Message-ID: 20170920010629.3icid5ukemkrr42k@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > # This probably means the server terminated abnormally
> > # before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
>
> That seems pretty weird. Maybe it's not the same case, but in
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
>
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
>
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG: connection received: host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG: connection authorized: user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG: statement: SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING: terminating connection because of crash of another server process
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
> Have we forgotten an fflush() or something?

After hacking a fix for my previous theory, I started adding strace into
the mix, to verify this. Takes longer to reproduce, but after filtering
to -e trace=network, I got this:

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340
sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79
sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
+++ exited with 2 +++

So indeed, we got a connreset before receiving the proper error
message.

The corresponding server log (debug3):
2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 730
2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 716
2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 715
2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 717
2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 718
2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 719
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING: terminating connection because of crash of another server process
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to roll back the current t
ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the database and repeat your c
ommand.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [720] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: proc_exit(-1): 0 callbacks to make
...
2017-09-20 00:57:00.577 UTC [713] DEBUG: server process (PID 730) exited with exit code 2
2017-09-20 00:57:00.577 UTC [713] DETAIL: Failed process was running: SELECT pg_sleep(3600);
2017-09-20 00:57:00.577 UTC [713] LOG: all server processes terminated; reinitializing

So the server indeed was killed by SIGQUIT, not an escalation to
SIGKILL. And it output stuff to the server log, and didn't complain
about communication to the client... Odd.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2017-09-20 02:00:38 Re: pgsql: Make new crash restart test a bit more robust.
Previous Message Thomas Munro 2017-09-19 23:59:50 Re: pgsql: Set partitioned_rels appropriately when UNION ALL is used.

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-09-20 01:14:41 Re: sync process names between ps and pg_stat_activity
Previous Message Peter Eisentraut 2017-09-20 01:01:57 Re: [PATCH] Generic type subscripting