pgsql: Fix race conditions in ProcKill()'s lock-group freelist handling

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Fix race conditions in ProcKill()'s lock-group freelist handling
Date: 2026-05-27 05:53:10
Message-ID: E1wS7CY-001Ink-22@gemulon.postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix race conditions in ProcKill()'s lock-group freelist handling

This commit fixes two bugs in ProcKill()'s lock-group teardown freelist
publication:
* a double push of the leader's PGPROC that corrupts the freelist.
* a leak of the last follower's PGPROC slot.

ProcKill()'s lock-group teardown had two PGPROC freelist updates
scattered through the function, done under two separate freeProcsLock
acquisitions:
* A follower's push of the leader's PGPROC, done when a follower is the
last group member exiting.
* Every backend's self-push at the bottom of the function.

The two freelist updates were coordinated only by inspecting
proc->lockGroupLeader, which a follower could clear as a side effect of
pushing the leader. This coordination was broken. For example, with
two concurrent backends:
* The follower clears leader->lockGroupLeader and pushes the leader's
PGPROC under leader_lwlock.
* The follower does not clear its own proc->lockGroupLeader, being
skipped.
* When the leader reaches the bottom of ProcKill(), it sees a NULL
proc->lockGroupLeader (the follower cleared it) and pushes itself,
causing a second dlist_push_tail() of the same node onto the same
freelist.
* The follower at the bottom sees its own proc->lockGroupLeader being
not NULL (never cleared) and skips its own push, causing its own slot
to leak.

This commit refactors the freelist manipulation to be done in two
distinct phases, each step using its own lock acquisition to ensure that
each freelist operation happens in an isolated manner for each backend
(follower or leader):
- First, under a single leader_lwlock acquisition, check the state of
the lock-group. Depending on if we are dealing with a follower and/or a
leader, and if the leader has exited before a follower, then set some
state booleans that define which actions should be taken with the
freelist.
- Second, under a single freeProcsLock acquisition, perform the cleanup
actions, self-push of a backend and/or push of the leader back to the
freelist.

This is an old issue, dating back to 9.6 where parallel workers and lock
grouping has been added.

Author: Vlad Lesin <vladlesin(at)gmail(dot)com>
Reviewed-by: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Reviewed-by: Michael Paquier <michael(at)paquier(dot)xyz>
Discussion: https://postgr.es/m/d2983796-2603-41b7-a66e-fc8489ddb954@gmail.com
Backpatch-through: 14

Branch
------
REL_14_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/8007d118524a5d43568217341fab58b590930322

Modified Files
--------------
src/backend/storage/lmgr/proc.c | 79 +++++++++++++++++++++++++++--------------
1 file changed, 53 insertions(+), 26 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Michael Paquier 2026-05-27 08:20:29 pgsql: Fix procLatch ownership race in ProcKill()
Previous Message Fujii Masao 2026-05-27 01:36:43 pgsql: pg_createsubscriber: Fix cleanup of publisher-side objects after