Re: Race conditions with checkpointer and shutdown

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race conditions with checkpointer and shutdown
Date: 2019-04-18 22:22:41
Message-ID: 29459.1555626161@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> This is a curious thing from dragonet's log:

> 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request
> 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions
> 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver
> process due to administrator command
> 2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time

> LogCheckpointStart() is the thing that writes "starting: ...", and it
> prefers to report "shutdown" over "time", but it didn't.

Yeah, but since we don't see "shutting down", we know that the shutdown
checkpoint hasn't begun. Here's another similar case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54

The relevant fragment of the standby server's log is

2018-11-30 05:09:22.996 PST [4229] LOG: received fast shutdown request
2018-11-30 05:09:23.628 PST [4229] LOG: aborting any active transactions
2018-11-30 05:09:23.649 PST [4231] LOG: checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 1 recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563 kB, estimate=16563 kB
2018-11-30 05:09:23.679 PST [4229] LOG: background worker "logical replication launcher" (PID 4276) exited with exit code 1
2018-11-30 05:11:23.757 PST [4288] master LOG: unexpected EOF on standby connection
2018-11-30 05:11:23.883 PST [4229] LOG: received immediate shutdown request
2018-11-30 05:11:23.907 PST [4229] LOG: database system is shut down

To the extent that I've found logs in which the checkpointer prints
anything at all during this interval, it seems to be just quietly
plodding along with its usual business, without any hint that it's
aware of the pending shutdown request. It'd be very easy to believe
that the postmaster -> checkpointer SIGUSR2 is simply getting dropped,
or never issued.

Hmm ... actually, looking at the postmaster's logic, it won't issue
SIGUSR2 to the checkpointer until the walreceiver (if any) is gone.
And now that I think about it, several of these logs contain traces
showing that the walreceiver is still live. Like the one quoted above:
seems like the line from PID 4288 has to be from a walreceiver.

Maybe what we should be looking for is "why doesn't the walreceiver
shut down"? But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit. Tis a puzzlement.

I'm a bit tempted to add temporary debug logging to the postmaster so
that walreceiver start/stop is recorded at LOG level. We'd have to wait
a few weeks to have any clear result from the buildfarm, but I'm not sure
how we'll get any hard data without some such measures.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2019-04-18 22:39:46 Re: block-level incremental backup
Previous Message Thomas Munro 2019-04-18 21:58:40 Re: Race conditions with checkpointer and shutdown