Re: Buffer locking is special (hints, checksums, AIO writes)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Greg Burd <greg(at)burd(dot)me>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Date: 2025-11-25 00:09:38
Message-ID: CA+hUKGLmpStLUW3LVzPiR_-zJ8=QrMoBT82z7HnLzk9nMU=KGg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 21, 2025 at 9:51 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> It's worth pointing out that the new way of setting hint bits is inherently
> more expensive than what we did before - upgrading a lock to a different lock
> level isn't free, compared to doing, well, nothing.
>
> For paths that set the hint bits of a whole page, like a seqscan, that cost is
> more than amortized by the batched approach introduced in 0011. Those get
> faster with the patch, both when already hinted and when not.

Nice work!

> However, there are paths that aren't easily amenable to that approach, like
> e.g. an ordered index scan referencing unhinted tuples. There we only ever
> access a single tuple and release the upgraded lock after every tuple. If the
> index scan is perfectly correlated with the table and every tuple is unhinted,
> that's a decent amount of additional work.

Yeah, but it was only faster because it was cheating. It presumably
doesn't happen when you bulk load and then create index. It
presumably does happen when you insert a lot of data in order, on
first correlated index scan. Seems like an inherent limitation of the
current tuple-at-a-time architecture when combined with the *required*
interlocking, and not a blocker for this work.

+ Some filesystems, raid implementations, ... do not tolerate the data being

I was aware of BTRFS (EIO on read) and ZFS 2.4 (EIO on read or write
depending on configuration option), but hadn't thought about RAID.
Ugh, right, non-matching RAID1 mirrors (and I guess also b0rked RAID5
parity bits?). Fun.

https://bugzilla.kernel.org/show_bug.cgi?id=99171

> I've spent a lot of time micro-optimizing that workload, to avoid any
> significiant regressions. An extreme stress-test started out being about 20%
> slower than today, as of my current local version, it's a bit faster (~1%) on
> one of my machines and a bit slower (~2%) on another. Partially that was
> achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a fast
> path for updating a single hint bit, avoiding redundant reads of the lock
> state by having MarkSharedBufferDirtyHint(), ...), partially by optimizing the
> locking code. The latter is a bit of a cheat though - things would be even
> faster if we went with the old way of setting hint bits, but with the
> independent optimizations applied.
>
> I think that's ok though:
>
> 1) the old way of setting hint bits is a pretty dirty hack that causes issues
> in quite a few places.
>
> 2) by definition, having to set hint bits is an ephemeral state, once the hint
> bits are set, the difference vanishes
>
> 3) no normal workload shows the difference - my stress test does
> SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000;
> on a perfectly correlated table with very narrow rows, i.e. an index scan
> of the whole table, where none of the scan results are ever used. Once one
> actually uses the resulting rows, the performance difference completely
> vanishes.
>
> 4) as part of the index prefetching work, we might get the infrastructure to
> actually batch the hint-bit setting in this case too.

Yeah. Was just thinking the same. Both the streaming and batching
projects have opportunities to figure out an amortisation scheme. I
have a few vague ideas about stream-based approaches already, hmm...

+1, I think this is OK for now.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-11-25 00:17:19 Re: Buffer locking is special (hints, checksums, AIO writes)
Previous Message Jeff Davis 2025-11-24 23:57:43 Re: Remaining dependency on setlocale()