YA race condition in 001_stream_rep.pl (was Re: pgsql: Allow using the updated tuple while moving it to a different par)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: YA race condition in 001_stream_rep.pl (was Re: pgsql: Allow using the updated tuple while moving it to a different par)
Date: 2018-07-14 18:16:45
Message-ID: 1922.1531592205@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

I wrote:
> This looks to me like it's probably a low-probability timing problem in
> the test script itself.
> ...
> We could maybe hack around that with a short "sleep" in this script, but
> a less cruddy solution would be to change PostgresNode::reload so that
> it somehow waits for the server to process the reload signal.

On further reflection, I doubt that would help. The issue isn't whether
the standby_2 postmaster has processed the SIGHUP, but whether its
walreceiver process has sent a feedback message. The loop in
WalReceiverMain does things in this order:

1. Check for SIGTERM, exit if seen.

2. Check for SIGHUP, process that (and possibly send a feedback message)
if seen.

3. Read and process streaming data till no more is available.

We've seen just a tiny number of failures at this point in the test
--- scraping the buildfarm database, I find these:

sysname | snapshot | stage | l
--------------+---------------------+---------------+------------------------------------------------------------------------------
longfin | 2017-09-26 14:39:05 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155.
flaviventris | 2018-04-07 17:02:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155.
serinus | 2018-05-18 20:14:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155.
idiacanthus | 2018-06-23 12:46:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
francolin | 2018-06-29 13:30:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
flaviventris | 2018-07-12 07:36:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
(6 rows)

So it's clearly a very low-probability race condition. My hypothesis
is that the SIGHUP arrives, but the walreceiver process is already in
the read-and-process-data part of its loop, and the timing is such that
it manages to stay there until we issue the standby2 shutdown command.
When it finally notices that it has no more data available, it iterates
the outermost loop, and then shuts down without ever having sent a
feedback message. The "replay_check()" call in the test script doesn't
ensure any synchronization here, since it only verifies that WAL data
has been applied, not that any feedback has happened.

I can reproduce the failure pretty reliably with a hack like the one
attached, which makes it unlikely that the walreceiver will send a
feedback message instantly when it gets the SIGHUP.

So the issue boils down to this: the test script is, effectively,
assuming that it's guaranteed that the walreceiver will send a feedback
message before it shuts down; but there is no such guarantee. Is this
a bug in the test script, or a bug in the walreceiver logic? I can
see the value of having such a guarantee, but I also think it would be
nigh impossible to make it a bulletproof guarantee. We could perhaps
add "XLogWalRcvSendHSFeedback(true)" somewhere closer to process exit,
but that might add more failure modes than it removes.

Or we could change the test script to wait for feedback before it
issues the shutdown, but then I think the test is a bit pointless.

Comments?

regards, tom lane

Attachment Content-Type Size
delay-hs-feedback.patch text/x-diff 1.0 KB

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Alvaro Herrera 2018-07-15 00:11:38 Re: pgsql: Fix FK checks of TRUNCATE involving partitioned tables
Previous Message Tom Lane 2018-07-14 15:59:24 pgsql: Fix hashjoin costing mistake introduced with inner_unique optimi

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-14 21:20:41 Re: GSOC 2018 Project - A New Sorting Routine
Previous Message Andrey Borodin 2018-07-14 17:43:26 Re: GiST VACUUM