Re: Another reason why the recovery tests take a long time

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Another reason why the recovery tests take a long time
Date: 2017-06-26 16:46:59
Message-ID: 20170626164659.ea43donv7tqmwl5a@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> I've found another edge-case bug through investigation of unexpectedly
> slow recovery test runs. It goes like this:
>
> * While streaming from master to slave, test script shuts down master
> while slave is left running. We soon restart the master, but meanwhile:
>
> * slave's walreceiver process fails, reporting
>
> 2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
> 2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
> 2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress
>
> * slave's startup process observes that walreceiver is gone and sends
> PMSIGNAL_START_WALRECEIVER to ask for a new one
>
> * more often than you would guess, in fact nearly 100% reproducibly for
> me, the postmaster receives/services the PMSIGNAL before it receives
> SIGCHLD for the walreceiver. In this situation sigusr1_handler just
> throws away the walreceiver start request, reasoning that the walreceiver
> is already running.

Yuck.

I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message. I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.

> * eventually, it dawns on the startup process that the walreceiver
> isn't starting, and it asks for a new one. But that takes ten seconds
> (WALRCV_STARTUP_TIMEOUT).
>
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver. But I wonder whether it's intentional that the old
> walreceiver dies in the first place. That FATAL exit looks suspiciously
> like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message nb 2017-06-26 16:57:11 pg_basebackup fails on Windows when using tablespace mapping
Previous Message Alexander Korotkov 2017-06-26 16:44:44 Re: GSoC 2017: Foreign Key Arrays