Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Vlad Lesin <vladlesin(at)gmail(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race
Date: 2026-05-28 01:08:28
Message-ID: aheVjCHmcbXBtiy0@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 27, 2026 at 02:03:37PM +0300, Vlad Lesin wrote:
> Attached is an updated patch with the test rebased onto the latest master.

I have been looking at that, and after putting my HEAD on it I have
concluded that this test is an anti-pattern based on its current
shape. The test relies on a wait, but the wait mechanism in
injection_points will not be able to work as the latch of the backends
get detached earlier after the fix done at 84b9d6bceab6 to prevent the
recycle race. What your proposed test was relying on here is a
statement timeout to make sure that the leader stays long enough in
the second point so as the followers are awaken and can finish their
termination business first. But we cannot really rely on that, and it
makes the test much longer than necessary on fast machines, dependent
on statement_timeout.

Another thing that we could do is redesign the wait mechanism in
injection_points to not rely on latches, using a shared memory flag
where we don't depend on latches and conditional variables. I cannot
get excited enough about that for this specific case, but perhaps
somebody would like to give it a shot? The use of latches has been
mentioned more than once as an annoying limitation, but I've always
found that less efficient on fast machines as it would require an
internal polling with a sleep or equivalent. This thread would give
an extra reason to do so, perhaps not in the stable branches but HEAD
once v20 opens up?

prockill_attach_injection_wait_pid() is not really necessary. We
could just reuse the existing attach() function and add an optional
PID argument, defaulting to 0.

Using two different points, one for the leader and one for the
follower is indeed the correct way to do things.

Another thing that one may try is to switch the test to use NOTICEs
and server log lookups. That may catch the PANIC, but one really
needs to be lucky so it would be mostly a waste of cycles in the
buildfarm due to the low reproducibility rate.

The test was still a useful exercise to prove the bug, though. If one
reverts 84b9d6bceab6 and runs the test, we are able to reproduce the
original bug.

Anyway, if we are to move ahead with this test, I'd suggest a couple
of patch pieces first:
- Switch the wait mechanism to use something more primitive, with a
shmem state.
- Extend injection_points_attach() with an optional PID.
- Add the test.
All that would be quite invasise, especially for the 1st point, so
introducing that only in v20 may be better. I am not sure that the
case of this thread is good enough to justify these changes, TBH.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message SATYANARAYANA NARLAPURAM 2026-05-28 01:11:16 Re: [PATCH] Release replication slot on error in SQL-callable slot functions
Previous Message SATYANARAYANA NARLAPURAM 2026-05-28 01:08:21 Re: [PATCH] Release replication slot on error in SQL-callable slot functions