Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, exclusion(at)gmail(dot)com
Subject: Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date: 2024-03-07 00:54:24
Message-ID: ZekQQHCrIqLVpGz5@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 06, 2024 at 05:45:56PM +0000, Bertrand Drouvot wrote:
> Thank you both for the report! I did a few test manually and can see the issue
> from times to times. When the issue occurs, the logical decoding was able to
> go through the place where LogicalConfirmReceivedLocation() updates the
> slot's catalog_xmin before being killed. In that case I can see that the
> catalog_xmin is updated to the xid horizon.

Two buildfarm machines have complained here, and one of them twice in
a row. That's quite amazing, because a couple of dozen runs done in a
row on the same host as these animals all pass. The CI did not
complain either (did 2~3 runs there yesterday).

> Means in a failed test we have something like:
>
> slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839."
>
> While when the test is ok you'll see something like:
>
> slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

Perhaps we should also make the test report the catalog_xmin of the
slot. That may make debugging a bit easier.

> In the failing test the call to SELECT pg_logical_slot_get_changes() does
> not advance the slot's catalog xmin anymore.

Is that something that we could enforce in the test in a stronger way,
cross-checking the xmin horizon before and after the call?

> To fix this, I think we need a new transacion to decode from the primary before
> executing pg_logical_slot_get_changes(). But this transaction has to be replayed
> on the standby first by the startup process. Which means we need to wakeup
> "terminate-process-holding-slot" and that we probably need another injection
> point somewehere in this test.
>
> I'll look at it unless you've another idea?

I am wondering if there is something else lurking here, actually, so
for now I am going to revert the change as it is annoying to get
sporadic failures in the CF bot at this time of the year and there are
a lot of patches under discussion. Let's give it more time and more
thoughts, without pressure.
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-03-07 02:04:51 Re: Synchronizing slots from primary to standby
Previous Message Thomas Munro 2024-03-07 00:37:36 Re: Combine headerscheck and cpluspluscheck scripts