Re: Race condition in InvalidateObsoleteReplicationSlots()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Fujii Masao <fujii(at)postgresql(dot)org>
Subject: Re: Race condition in InvalidateObsoleteReplicationSlots()
Date: 2021-06-20 17:19:31
Message-ID: 1510400.1624209571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Hmm ... desmoxytes has failed this test once, out of four runs since
> it went in:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04

I studied this failure a bit more, and I think the test itself has
a race condition. It's doing

# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
$logstart = get_log_size($node_primary3);
advance_wal($node_primary3, 4);
ok(find_in_log($node_primary3, "to release replication slot", $logstart),
"walreceiver termination logged");

The string it's looking for does show up in node_primary3's log, but
not for another second or so; we can see instances of the following
poll_query_until query before that happens. So the problem is that
there is no interlock to ensure that the walreceiver terminates
before this find_in_log check looks for it.

You should be able to fix this by adding a retry loop around the
find_in_log check (which would likely mean that you don't need
to do multiple advance_wal iterations here).

However, I agree with reverting the test for now and then trying
again after beta2.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-06-20 17:24:04 Re: PXGS vs TAP tests
Previous Message Zhihong Yu 2021-06-20 17:06:47 Re: Speed up transaction completion faster after many relations are accessed in a transaction