Re: Race condition in InvalidateObsoleteReplicationSlots()

From: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Fujii Masao <fujii(at)postgresql(dot)org>
Subject: Re: Race condition in InvalidateObsoleteReplicationSlots()
Date: 2021-06-11 16:27:57
Message-ID: 202106111627.t4yrl54wtpsf@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-Jun-10, Álvaro Herrera wrote:

> Here's a version that I feel is committable (0001). There was an issue
> when returning from the inner loop, if in a previous iteration we had
> released the lock. In that case we need to return with the lock not
> held, so that the caller can acquire it again, but weren't. This seems
> pretty hard to hit in practice (I suppose somebody needs to destroy the
> slot just as checkpointer killed the walsender, but before checkpointer
> marks it as its own) ... but if it did happen, I think checkpointer
> would block with no recourse. Also added some comments and slightly
> restructured the code.
>
> There are plenty of conflicts in pg13, but it's all easy to handle.

Pushed, with additional minor changes.

> I wrote a test (0002) to cover the case of signalling a walsender, which
> is currently not covered (we only deal with the case of a standby that's
> not running). There are some sharp edges in this code -- I had to make
> it use background_psql() to send a CHECKPOINT, which hangs, because I
> previously send a SIGSTOP to the walreceiver. Maybe there's a better
> way to achieve a walreceiver that remains connected but doesn't consume
> input from the primary, but I don't know what it is. Anyway, the code
> becomes covered with this. I would like to at least see it in master,
> to gather some reactions from buildfarm.

I tried hard to make this stable, but it just isn't (it works fine one
thousand runs, then I grab some coffee and run it once more and that one
fails. Why? that's not clear to me). Attached is the last one I have,
in case somebody wants to make it better. Maybe there's some completely
different approach that works better, but I'm out of ideas for now.

--
Álvaro Herrera Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

Attachment Content-Type Size
v3-0001-Add-test-case-for-invalidating-an-active-replicat.patch text/x-diff 4.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-06-11 16:34:38 Re: "an SQL" vs. "a SQL"
Previous Message Fujii Masao 2021-06-11 16:25:21 Re: Transactions involving multiple postgres foreign servers, take 2