Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, exclusion(at)gmail(dot)com
Subject: Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date: 2024-02-19 08:15:16
Message-ID: CALj2ACUMPTvuk0_GuPiFbUb=PMTQq0WMi5Kvj8Mz0JZ_ico1Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> > Yeah, comments added in v3.
>
> The contents look rather OK, I may do some word-smithing for both.

Here are some comments on v3:

1.
+ XLogRecPtr initial_effective_xmin = InvalidXLogRecPtr;
+ XLogRecPtr initial_catalog_effective_xmin = InvalidXLogRecPtr;
+ XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr;

Prefix 'initial_' makes the variable names a bit longer, I think we
can just use effective_xmin, catalog_effective_xmin and restart_lsn,
the code updating then only when if (!terminated) tells one that they
aren't updated every time.

2.
+ /*
+ * We'll release the slot's mutex soon, so it's possible that
+ * those values change since the process holding the slot has been
+ * terminated (if any), so record them here to ensure we would
+ * report the slot as obsolete correctly.
+ */

This needs a bit more info as to why and how effective_xmin,
catalog_effective_xmin and restart_lsn can move ahead after signaling
a backend and before the signalled backend reports back.

3.
+ /*
+ * Assert that the conflict cause recorded previously before we
+ * terminate the process did not change now for any reason.
+ */
+ Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
+ conflict_prev != conflict));

It took a while for me to understand the above condition, can we
simplify it like below using De Morgan's laws for better readability?

Assert((conflict_prev == RS_INVAL_NONE) || !terminated ||
(conflict_prev == conflict));

> > Are we going to fix this on master and 16 stable first and then later on add a
> > test on master with the injection points?
>
> Still, the other patch is likely going to take a couple of weeks
> before getting into the tree, so I have no objection to fix the bug
> first and backpatch, then introduce a test. Things have proved that
> failures could show up in the buildfarm in this area, so more time
> running this code across two branches is not a bad concept, either.

While I couldn't agree more on getting this fix in, it's worth pulling
in the required injection points patch here and writing the test to
reproduce this race issue.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-02-19 08:25:14 Re: System username in pg_stat_activity
Previous Message Quan Zongliang 2024-02-19 08:05:03 Re: Improvement discussion of custom and generic plans