Buffer locking is special (hints, checksums, AIO writes)

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Buffer locking is special (hints, checksums, AIO writes)
Date: 2025-08-22 19:44:48
Message-ID: fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm working on making bufmgr.c ready for AIO writes. That requires some
infrastructure changes that I wanted to discuss... In this email I'm trying to
outline the problems and what I currently think we should do.

== Problem 1 - hint bits ==

Currently pages that are being written out are copied if checksums are
enabled. The copy is needed because hint bits can be set with just a share
lock. Writing out a buffer only requires a share lock. If we didn't copy the
buffer, the computed checksum can be falsified by the checksum.

Even without checksums hint bits being set while IO is ongoing is an issue,
e.g. btrfs assumes that pages do not change while being written out with
direct-io, and corrupts itself if they do [1].

The copy we make to avoid checksum failure are not at all cheap [2], but are
particularly problematic for AIO, where a lot of buffers can undergo AIO at
the same time. For AIO the issue is that the longer-lived bounce buffers that
I had initially prototyped actually end up using a lot of memory, making it
hard to figure out what defaults to set.

In [2] I had worked on an approach to avoid copying pages during writes. It
avoided the need to copy buffers by not allowing hint bits to be set while IO
is ongoing. It did so by skipping hint bit writes while IO is ongoing (plus a
fair bit of infrastructure to make that cheap enough).

In that thread Heikki was wondering whether we should instead not go for a
more fundamental solution to the problem, like introducing a separate lock
level that prevents concurrent modifications but allows share lockers. At the
time I was against that, because it seemed like a large project.

However, since then I realized there's a architecturally related second issue:

== Problem 2 - AIO writes vs exclusive locks ==

Separate from the hint bit issue, there is a second issue that I didn't have a
good answer for: Making acquiring an exclusive lock concurrency safe in the
presence of asynchronous writes:

The problem is that while a buffer is being written out, it obviously has to
be share locked. That's true even with AIO. With AIO the share lock is held
once the IO is completed. The problem is that if a backend wants to
exclusively lock a buffer undergoing AIO, it can't just wait for the content
lock as today, it might have to actually reap the IO completion from the
operating system. If one just were to wait for the content lock, there's no
forward progress guarantee.

The buffer's state "knows" that it's undergoing write IO (BM_VALID and
BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
problem is that it's surprisingly hard to do so race free:

If a backend A were to just check if a buffer is undergoing IO before locking
it, a backend B could start IO on the buffer between A checking for
BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just
hold the buffer header spinlock across a blocking lwlock acquisition.

There potentially are ways to synchronize the buffer state and the content
lock, but it requires deep integration between bufmgr.c and lwlock.c.

== Problem 3 - Cacheline contention ==

This is unrelated to AIO, but might influence the architecture for potential
solutions. It might make sense to skip over this section on a quicker
read-through.

The problem is that in some workloads the cacheline containing the BufferDesc
becomes very hot. The most common case are workloads with lots of index nested
loops, the root page and some of the other inner pages in the index can become
very contended.

I've seen cases where running the same workload in four separate copies in the
same postgres instance yields ~8x the throughput - on a machine with forty
cores, there's machines with many more these days [4]/

There are three main issues:

a) We manipulate the same cache line multiple times. E.g. btree always first
pins the page and then locks the page, which performs two atomic operations
on the same cacheline.

I've experimented with putting the content lock on a separate cacheline,
but that causes regressions in other cases.

Leaving nontrivial implementation issues aside, we could release the lock
and pin at the same time. With a bit increased difficulty we could do the
same for pin and lock acquisition. nbtree almost always does the two
together, so it'd not be hard to make it benefit from such an optimization.

b) Many operations, like unpinning a buffer, need to use CAS, instead of an
atomic subtraction.

atomic add/sub scales a *lot* better than compare and swap, as there is no
need to retry. With increased contention the number of retries increases
further and further.

The reason we need to use CAS in so many places is the following:

Historically postgres' spinlocks don't use an atomic operation to release
the spinlock on x86. When making buffer header locks their own thing, I
carried that forward, to avoid performance regressions. Because of that
BufferDesc.state may not be modified while the buffer header spinlock is
held - which is incompatible with using atomic-sub.

However, since then we converted most of the "hot" uses of buffer header
spinlocks into CAS loops (see e.g. PinBuffer()). That makes it feasible to
use an atomic operation for the buffer header lock release, which in turn
allows e.g. unpinning with an atomic sub.

c) Read accesses to the BufferDesc cause contention

Some code, like nbtree, relies on functions like
BufferGetBlockNumber(). Unfortunately that contends with concurrent
modifications of the buffer descriptor (like pinning). Potential solutions
are to rely less on functions like BufferGetBlockNumber() or to split out
the memory for that into a separate (denser?) array.

d) Even after addressing all of the above, there's still a lot of contention

I think the solution here would be something roughly to fastpath locks. If
a buffer is very contended, we can mark it as super-pinned & share locked,
avoiding any atomic operation on the buffer descriptor itself. Instead the
current lock and pincount would be stored in each backends PGPROC.
Obviously evicting or exclusively-locking such a buffer would be a lot more
expensive.

I've prototyped it and it helps a *lot*. The reason I mention this here is
that this seems impossible to do while using the generic lwlocks for the
content lock.

== Solution ? ==

My conclusion from the above is that we ought to:

A) Make Buffer Locks something separate from lwlocks

As part of that introduce a new lock level in-between share and exclusive
locks (provisionally called share-exclusive, but I hate it). The new lock
level allows other share lockers, but can only be held by one backend.

This allows to change the rules so that:

1) Share lockers are not allowed to modify buffers anymore
2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
3) Write IO needs to the new lock level

This addresses 1) from above

B) Merge BufferDesc.state and the content lock

This allows to address 2) from above, as we now atomically can check if IO
was concurrently initiated.

Obviously BufferDesc.state is not currently wide enough, therefore the
buffer state has to be updated to a 64bit variable.

C) Allow some modifications of BufferDesc.state while holding spinlock

Just naively doing the above two things reduces scalability, as the
likelihood of CAS failures increases, due to increased number of
modifications of the same atomic variable.

However, by allowing unpinning while the buffer header spinlock is held,
scalability considerably improves in my tests.

Doing so requires changing all uses of LockBufHdr(), but by introducing a
static inline helper the complexity can largely be encapsulated.

I've prototyped the above. The current state is pretty rough, but before I
spend the non-trivial time to make it into an understandable sequence of
changes, I wanted to get feedback.

Does this plan sound reasonable?

The hardest part about this change is that everything kind of depends on each
other. The changes are large enough that they clearly can't just be committed
at once, but doing them over time risks [temporary] performance regressions.

The order of changes I think makes the most sense is the following:

1) Allow some modifications while holding the buffer header spinlock

2) Reduce buffer pin with just an atomic-sub

This needs to happen first, otherwise there are performance regressions
during the later steps.

3) Widen BufferDesc.state to 64 bits

4) Implement buffer locking inside BufferDesc.state

5) Do IO while holding share-exclusive lock and require all buffer
modifications to at least hold share exclusive lock

6) Wait for AIO when acquiring an exclusive content lock

(some of these will likely have parts of their own, but that's details)

Sane?

DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
[3] In some cases it causes slowdowns for checkpointer close to 50%!
[4] https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-08-22 20:01:53 Re: Improve LWLock tranche name visibility across backends
Previous Message Nathan Bossart 2025-08-22 19:34:00 Re: vacuumdb --missing-stats-only and permission issue