Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)lists(dot)postgresql(dot)org, exclusion(at)gmail(dot)com
Subject: Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date: 2024-03-06 17:45:56
Message-ID: Zeir1JpVsfdb7/nb@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Mar 06, 2024 at 05:45:56PM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote:
> > > Right, somehow out of context here.
> >
> > We're not yet in the green yet, one of my animals has complained:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03
> >
> > This is telling us that the xmin horizon is unchanged, and the test
> > cannot move on with the injection point wake up that would trigger the
> > following logs:
> > 2024-03-06 20:12:59.039 JST [21143] LOG: invalidating obsolete replication slot "injection_activeslot"
> > 2024-03-06 20:12:59.039 JST [21143] DETAIL: The slot conflicted with xid horizon 770.
> >
> > Not sure what to think about that yet.
>
> Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't
> happy about that as well [1]. It looks like the slot's catalog_xmin on
> the standby isn't moving forward.
>

Thank you both for the report! I did a few test manually and can see the issue
from times to times. When the issue occurs, the logical decoding was able to
go through the place where LogicalConfirmReceivedLocation() updates the
slot's catalog_xmin before being killed. In that case I can see that the
catalog_xmin is updated to the xid horizon.

Means in a failed test we have something like:

slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839."

While when the test is ok you'll see something like:

slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

In the failing test the call to SELECT pg_logical_slot_get_changes() does
not advance the slot's catalog xmin anymore.

To fix this, I think we need a new transacion to decode from the primary before
executing pg_logical_slot_get_changes(). But this transaction has to be replayed
on the standby first by the startup process. Which means we need to wakeup
"terminate-process-holding-slot" and that we probably need another injection
point somewehere in this test.

I'll look at it unless you've another idea?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-03-06 18:03:39 Re: Remove unnecessary code from psql's watch command
Previous Message Tomas Vondra 2024-03-06 17:34:15 Re: logical decoding and replication of sequences, take 2