Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race

From: Vlad Lesin <vladlesin(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race
Date: 2026-05-15 12:29:30
Message-ID: d0b9baa2-7445-40f6-b994-c36125e70b0d@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Michael, thank you for your review.

I am sending a new series of patches that incorporate your code review
notes. The version prefix is "vVL2-" to distinguish Andrey's patches
from mine.

On 5/11/26 04:58, Michael Paquier wrote:
> On Thu, May 07, 2026 at 04:57:28PM +0500, Andrey Borodin wrote:
>> Done so. I still keep Vlad's injection_point_condition.h though.
>> It seems useful. But no more Meson\Makefile changes, and new sql stuff lives
>> now near removable_cutoff() SQL.
>
> Because you want to have your callbacks in a new regress_prockill.c
> while being able to reuse the structures. I don't see why not on
> clarity ground. The creation of injection_point_condition.h could be
> done in its own commit, independent of the rest, but that's just me
> being pedantic about such matters. I think that this should be
> backpatched anyway, that could make the introduction of new tests
> easier.

I moved the creation of injection_point_condition.h into a separate
commit in 0001 patch. Additionally, I noticed that if several processes
are waiting on the same injection point, only one of them will be
awakened by a single injection_points_wakeup() call. I am not sure if
this behavior is intentional; please let me know.

For context, MySQL has a similar feature called debug sync points [0],
which allows waking up several threads waiting on them. I suppose this
same semantics would be useful for injection points as well to simplify
tests. I implemented this behavior in the 0002 patch and utilized it in
0003. If you disagree with this approach, I can rewrite the test to use
two separate injection points for the lock group leader and follower.

> +# Two backends form a lock group, then are terminated concurrently. The test
> +# uses injection points placed inside ProcKill() to pause both victims there
> +# and verify that a freshly forked backend can claim the recycled PGPROC slot
> +# without hitting "latch already owned by PID ..." PANIC.
>
> This claim is slightly incorrect here. If I undo the fix, place the
> two INJECTION_POINT macros and re-run the test, the test does not
> produce a PANIC. In order to reproduce a PANIC, it seems that we
> would need something more complicated with an extra point after a
> DisownLatch(). That is not required to me, still this is misleading
> and could be improved, I guess..

Yes, you are right, I fixed the race in 0003 patch with additional
injection point.

> I think that we should refactor the patch into two pieces,
> for clarity:
> - Patch 1 refactors the freelist manipulations and introduces the two
> new boolean flags, documenting more precily how the freelists are
> manipulated and why things are done this way (leader, self, etc.).
> - Patch 2, which is about the earlier DisownLatch(), then becomes a
> fix of its own. That's the primary fix we care about, to avoid the
> entries to be recycled due to a too-early PGPROC entry marked as
> available.

Done. I have split this into two separate patches, 0004 and 0005. The
test in 0003 fails on patch 0004 and passes on 0005.

> Also, I suspect that the test is racy? For one, I strongly suspect
> that you are going to need tweaks similar 011_lock_stats.pl in
> test_misc where we rely on some query_until() with psql banners, to
> make sure that the waits actually happen.. I got the point about
> pg_stat_activity not being something we can rely on here. That's
> something that would be easier to see on an overloaded machine.

The test should no longer be racy following the changes in 0003.

--
Best regards,
Vlad

[0] https://dev.mysql.com/doc/dev/mysql-server/9.6.0/PAGE_DEBUG_SYNC.html

Attachment Content-Type Size
vVL2-0001-injection_points-extract-condition-header.patch text/x-patch 4.1 KB
vVL2-0002-injection_points-wake-every-matching-waiter.patch text/x-patch 3.1 KB
vVL2-0003-add-prockill-lockgroup-regression-test.patch text/x-patch 18.8 KB
vVL2-0004-fix-lockgroup-double-push-and-leak.patch text/x-patch 8.1 KB
vVL2-0005-fix-prockill-lockgroup-procLatch-recycle-race.patch text/x-patch 4.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Xuneng Zhou 2026-05-15 12:53:50 Re: Bound memory usage during manual slot sync retries
Previous Message Zsolt Parragi 2026-05-15 11:56:46 Re: COPY ON_CONFLICT TABLE; save duplicated record to another table.