| From: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
|---|---|
| To: | Vlad Lesin <vladlesin(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race |
| Date: | 2026-05-05 09:07:17 |
| Message-ID: | 1CB9D745-AA3C-427F-8DD5-2140F1DEF229@yandex-team.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On 27 Apr 2026, at 13:14, Vlad Lesin <vladlesin(at)gmail(dot)com> wrote:
>
> Problem
> ------------------------------------------------------------------------
>
> If a leader detaches from the lock group under leader_lwlock but
> has not yet reached DisownLatch(&MyProc->procLatch), a concurrent
> last follower can still put the *leader* PGPROC on a free list, or
> the leader and the follower can make inconsistent decisions about
> *who* returns which PGPROC, so that a slot is linked into the free
> list with procLatch still owned, or is pushed twice. A new backend
> that recycles the slot can then hit:
>
> PANIC: latch already owned by PID ...
>
> A concrete interleaving (lock group leader vs last member)
> is the following(PG15 code).
Yeah, the problem seems real to me. Moreover we had related buildfarm
failures [0] and Deep from GP reported observing the problem there too.
Yugabyte folks also observed this [1].
The invariant that latch should not be on freelist until it is disowned seems
reasonable to me.
But the test and the fix both are very confusing here. They are not patch steps
as someone might expect given 0001,0002,0003 prefixes. They are not based on
PG 18 as filenames states.
To help resolve this confusion I'm posting following sequence:
1. vAB1-0001-Add-regression-test-for-ProcKill-lock-group-pro.patch
This is an original test that is expected to demonstrate problem.
It contains heavy injection points refactoring, I assume it's not intended for commit.
This test was taken from a file 0003-PG18-unfixed-repro-tap-injection-harness.patch
2. vAB1-0002-Canonicalize-test-with-infrastructure.patch
My changes needed to make test runnable.
3. vAB1-0003-Fix-ProcKill-lock-group-vs-procLatch-recycle-ra.patch
Fix for the problem, proposed by the thread starter, rebased on current HEAD
and test patch.
The test passes after this step.
I would like to recommend author to make the patch leaner and easier for review.
Best regards, Andrey Borodin.
[0] https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_0RGcr7oUNzcHdn7zHqHSB_wLSd3JyS2YC_DYB%2B-V%3Dg%40mail.gmail.com
[1] https://github.com/yugabyte/yugabyte-db/issues/20309
| Attachment | Content-Type | Size |
|---|---|---|
| vAB1-0001-Add-regression-test-for-ProcKill-lock-group-pro.patch | application/octet-stream | 33.1 KB |
| vAB1-0002-Canonicalize-test-with-infrastructure.patch | application/octet-stream | 3.9 KB |
| vAB1-0003-Fix-ProcKill-lock-group-vs-procLatch-recycle-ra.patch | application/octet-stream | 7.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Henson Choi | 2026-05-05 09:15:29 | Re: Row pattern recognition |
| Previous Message | Ayush Tiwari | 2026-05-05 08:56:46 | Re: [PATCH] Fix duplicate errmsg in ALTER TABLE SPLIT PARTITION |