Re: Race condition in InvalidateObsoleteReplicationSlots()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Fujii Masao <fujii(at)postgresql(dot)org>
Subject: Re: Race condition in InvalidateObsoleteReplicationSlots()
Date: 2021-06-20 18:37:09
Message-ID: 1514140.1624214229@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I studied this failure a bit more, and I think the test itself has
> a race condition. It's doing
>
> # freeze walsender and walreceiver. Slot will still be active, but walreceiver
> # won't get anything anymore.
> kill 'STOP', $senderpid, $receiverpid;
> $logstart = get_log_size($node_primary3);
> advance_wal($node_primary3, 4);
> ok(find_in_log($node_primary3, "to release replication slot", $logstart),
> "walreceiver termination logged");

Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond. But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart. I think you
need to capture that value before the kill not after.

I also suggest that it wouldn't be a bad idea to make the
find_in_log check more specific, by including the expected PID
and perhaps the expected slot name in the string. The full
message in primary3's log looks like

2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG: terminating process 3548959 to release replication slot "rep3"

and I don't understand why we wouldn't match on the whole
message text. (I think doing so will also reveal that what
we are looking for here is the walsender pid, not the walreceiver
pid, and thus that the description in the ok() call is backwards.
Or maybe we do want to check the walreceiver side, in which case
we are searching the wrong postmaster's log?)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-06-20 22:10:29 Reducing the cycle time for CLOBBER_CACHE_ALWAYS buildfarm members
Previous Message Andrey Borodin 2021-06-20 18:15:08 Re: Different compression methods for FPI