pgsql: bufmgr: Implement buffer content locks independently of lwlocks

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: bufmgr: Implement buffer content locks independently of lwlocks
Date: 2026-01-15 19:28:47
Message-ID: E1vgT1S-000fPa-35@gemulon.postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers

bufmgr: Implement buffer content locks independently of lwlocks

Until now buffer content locks were implemented using lwlocks. That has the
obvious advantage of not needing a separate efficient implementation of
locks. However, the time for a dedicated buffer content lock implementation
has come:

1) Hint bits are currently set while holding only a share lock. This leads to
having to copy pages while they are being written out if checksums are
enabled, which is not cheap. We would like to add AIO writes, however once
many buffers can be written out at the same time, it gets a lot more
expensive to copy them, particularly because that copy needs to reside in
shared buffers (for worker mode to have access to the buffer).

In addition, modifying buffers while they are being written out can cause
issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
like that, due to filesystem internal checksums getting corrupted.

The solution to this is to require a new share-exclusive lock-level to set
hint bits and to write out buffers, making those operations mutually
exclusive. We could introduce such a lock-level into the generic lwlock
implementation, however it does not look like there would be other users,
and it does add some overhead into important code paths.

2) For AIO writes we need to be able to race-freely check whether a buffer is
undergoing IO and whether an exclusive lock on the page can be acquired. That
is rather hard to do efficiently when the buffer state and the lock state
are separate atomic variables. This is a major hindrance to allowing writes
to be done asynchronously.

3) Buffer locks are by far the most frequently taken locks. Optimizing them
specifically for their use case is worth the effort. E.g. by merging
content locks into buffer locks we will be able to release a buffer lock
and pin in one atomic operation.

4) There are more complicated optimizations, like long-lived "super pinned &
locked" pages, that cannot realistically be implemented with the generic
lwlock implementation.

Therefore implement content locks inside bufmgr.c. The lockstate is stored as
part of BufferDesc.state. The implementation of buffer content locks is fairly
similar to lwlocks, with a few important differences:

1) An additional lock-level share-exclusive has been added. This lock-level
conflicts with exclusive locks and itself, but not share locks.

2) Error recovery for content locks is implemented as part of the already
existing private-refcount tracking mechanism in combination with resowners,
instead of a bespoke mechanism as the case for lwlocks. This means we do
not need to add dedicated error-recovery code paths to release all content
locks (like done with LWLockReleaseAll() for lwlocks).

3) The lock state is embedded in BufferDesc.state instead of having its own
struct.

4) The wakeup logic is a tad more complicated due to needing to support the
additional lock-level

This commit unfortunately introduces some code that is very similar to the
code in lwlock.c, however the code is not equivalent enough to easily merge
it. The future wins that this commit makes possible seem worth the cost.

As of this commit nothing uses the new share-exclusive lock mode. It will be
used in a future commit. It seemed too complicated to introduce the lock-level
in a separate commit.

It's worth calling out one wart in this commit: Despite content locks not
being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better
than duplicating the relevant infrastructure.

Another thing worth pointing out is that, after this change, content locks are
not reported as LWLock wait events anymore, but as new wait events in the
"Buffer" wait event class (see also 6c5c393b740). The old BufferContent lwlock
tranche has been removed.

Reviewed-by: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Reviewed-by: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Reviewed-by: Greg Burd <greg(at)burd(dot)me>
Reviewed-by: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/fcb9c977aa5f1eefe7444e423e833ff64a5d1d8f

Modified Files
--------------
src/backend/storage/buffer/buf_init.c | 5 +-
src/backend/storage/buffer/bufmgr.c | 913 ++++++++++++++++++++++--
src/backend/utils/activity/wait_event_names.txt | 4 +-
src/include/storage/buf_internals.h | 73 +-
src/include/storage/bufmgr.h | 32 +-
src/include/storage/lwlocklist.h | 1 -
src/include/storage/proc.h | 8 +-
7 files changed, 931 insertions(+), 105 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2026-01-15 19:59:29 pgsql: pgindent fix for 8077649907d
Previous Message Tom Lane 2026-01-15 19:12:26 pgsql: Optimize LISTEN/NOTIFY via shared channel map and direct advance