Quick Links

Re: Another reason why the recovery tests take a long time

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Another reason why the recovery tests take a long time
Date:	2017-06-26 16:46:59
Message-ID:	20170626164659.ea43donv7tqmwl5a@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> I've found another edge-case bug through investigation of unexpectedly
> slow recovery test runs. It goes like this:
>
> * While streaming from master to slave, test script shuts down master
> while slave is left running. We soon restart the master, but meanwhile:
>
> * slave's walreceiver process fails, reporting
>
> 2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
> 2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
> 2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress
>
> * slave's startup process observes that walreceiver is gone and sends
> PMSIGNAL_START_WALRECEIVER to ask for a new one
>
> * more often than you would guess, in fact nearly 100% reproducibly for
> me, the postmaster receives/services the PMSIGNAL before it receives
> SIGCHLD for the walreceiver. In this situation sigusr1_handler just
> throws away the walreceiver start request, reasoning that the walreceiver
> is already running.

Yuck.

I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message. I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.

> * eventually, it dawns on the startup process that the walreceiver
> isn't starting, and it asks for a new one. But that takes ten seconds
> (WALRCV_STARTUP_TIMEOUT).
>
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver. But I wonder whether it's intentional that the old
> walreceiver dies in the first place. That FATAL exit looks suspiciously
> like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

- Andres

In response to

Another reason why the recovery tests take a long time at 2017-06-26 16:32:00 from Tom Lane

Responses

Re: Another reason why the recovery tests take a long time at 2017-06-26 17:42:52 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	nb	2017-06-26 16:57:11	pg_basebackup fails on Windows when using tablespace mapping
Previous Message	Alexander Korotkov	2017-06-26 16:44:44	Re: GSoC 2017: Foreign Key Arrays