Re: Buffer locking is special (hints, checksums, AIO writes)

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Date: 2025-08-27 00:14:49
Message-ID: 20250827001449.fb.nmisch@google.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> I'm working on making bufmgr.c ready for AIO writes.

Nice!

> == Problem 2 - AIO writes vs exclusive locks ==
>
> Separate from the hint bit issue, there is a second issue that I didn't have a
> good answer for: Making acquiring an exclusive lock concurrency safe in the
> presence of asynchronous writes:
>
> The problem is that while a buffer is being written out, it obviously has to
> be share locked. That's true even with AIO. With AIO the share lock is held
> once the IO is completed. The problem is that if a backend wants to
> exclusively lock a buffer undergoing AIO, it can't just wait for the content
> lock as today, it might have to actually reap the IO completion from the
> operating system. If one just were to wait for the content lock, there's no
> forward progress guarantee.
>
> The buffer's state "knows" that it's undergoing write IO (BM_VALID and
> BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
> locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
> problem is that it's surprisingly hard to do so race free:
>
> If a backend A were to just check if a buffer is undergoing IO before locking
> it, a backend B could start IO on the buffer between A checking for
> BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just
> hold the buffer header spinlock across a blocking lwlock acquisition.
>
> There potentially are ways to synchronize the buffer state and the content
> lock, but it requires deep integration between bufmgr.c and lwlock.c.

You may have considered and rejected simpler alternatives for (2) before
picking the approach you go on to outline. Anything interesting? For
example, I imagine these might work with varying degrees of inefficiency:

- Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
there's a non-I/O lock conflict.
- Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.

> == Problem 3 - Cacheline contention ==

> c) Read accesses to the BufferDesc cause contention
>
> Some code, like nbtree, relies on functions like
> BufferGetBlockNumber(). Unfortunately that contends with concurrent
> modifications of the buffer descriptor (like pinning). Potential solutions
> are to rely less on functions like BufferGetBlockNumber() or to split out
> the memory for that into a separate (denser?) array.

Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data
structure, since the buffer's mapping can't change until we unpin.

> d) Even after addressing all of the above, there's still a lot of contention
>
> I think the solution here would be something roughly to fastpath locks. If
> a buffer is very contended, we can mark it as super-pinned & share locked,
> avoiding any atomic operation on the buffer descriptor itself. Instead the
> current lock and pincount would be stored in each backends PGPROC.
> Obviously evicting or exclusively-locking such a buffer would be a lot more
> expensive.
>
> I've prototyped it and it helps a *lot*. The reason I mention this here is
> that this seems impossible to do while using the generic lwlocks for the
> content lock.

Nice.

On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > The order of changes I think makes the most sense is the following:

No concerns so far. I won't claim I can picture all the implications and be
sure this is the right thing, but it sounds promising. I like your principle
of ordering changes to avoid performance regressions.

> > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

I would consider {AccessShare, Exclusive, AccessExclusive}. What the $SUBJECT
proposal calls SHARE-EXCLUSIVE would become Exclusive. That has the same
conflict matrix as the corresponding heavyweight locks, which seems good. I
don't love our mode names, particularly ShareRowExclusive being unsharable.
However, learning one special taxonomy is better than learning two.

> > AFAIK "share exclusive" or "SX" is standard terminology.

Can you say more about that? I looked around at
https://google.com/search?q=share+exclusive+%22sx%22+lock but didn't find
anything well-aligned with the proposal:

https://dev.mysql.com/doc/dev/mysql-server/latest//PAGE_LOCK_ORDER.html looked
most relevant, but it doesn't give the big idea.
https://mysqlonarm.github.io/Understanding-InnoDB-rwlock-stats/ is less
authoritative but does articulate the big idea, as "Shared-Exclusive (SX):
offer write access to the resource with inconsistent read. (relaxed
exclusive)." That differs from $SUBJECT semantics, in which SHARE-EXCLUSIVE
can't see inconsistent reads.

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_LOCK.html
has term SX = "sub exclusive". I gather an SX lock on a table lets one do
SELECT FOR UPDATE on that table (each row is the "sub"component being locked).

https://man.freebsd.org/cgi/man.cgi?query=sx_slock&sektion=9&format=html uses
the term "SX", but it's more like our lwlocks. One acquires S or X, not
blends of them.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mihail Nikalayeu 2025-08-27 00:38:00 Re: Adding REPACK [concurrently]
Previous Message John Naylor 2025-08-26 23:38:50 Re: Generate GUC tables from .dat file