Re: Buffer locking is special (hints, checksums, AIO writes)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Date: 2026-01-29 19:29:27
Message-ID: cmjazttp6zz5gttyxfp3iakcaqxev33vanks4uhrwjyskdrzqz@er2mmhtobt62
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-01-29 13:33:02 -0500, Peter Geoghegan wrote:
> On Thu, Jan 29, 2026 at 1:06 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Wonder if - independent of this
> > issue - it could make sense to update the FSM during nbtree WAL recovery...
>
> Maybe that would make sense. But I tend to think that we should have a
> fully atomic, crash-safe approach to free space management.

I agree that would be nice, but realistically (as you also say below) that
would have to be embedded into the WAL records that use the page that was
acquired from the FSM. Maybe we could accept a dedicated WAL record for the
index case, but certainly not in the heap case.

Given that we'd need to embed the record somehow anyway, just adding, for now,
a RecordUsedIndexPage() to the redo of XLOG_BTREE_SPLIT* and
XLOG_BTREE_NEWROOT or such could make sense...

It doesn't seem like it'd be great to have a completely outdated index fsm
after a failover. If the index FSM on the newly promoted node is completely
outdated, due to having been copied at a much earlier time while there were a
lot of free pages, a _bt_allocbuf() could take quite a while...

I'm somewhat surprised it doesn't cause more performance issues to keep btree
pages exclusively locked while extending the relation... If that has to write
out pages and flush the WAL...

> Particularly in index AMs, where free space can only ever come in
> BLCKSZ units -- the data structure/concurrency rules can be a lot
> simpler if it only has to accommodate index AM requirements. Maybe the
> WAL-logging could be built into existing index AM record types.

Yea, I have my doubt that makes sense to share code between the index and heap
use cases. I doubt that having one FSM implementation support variable amount
of "space tracking granularity" really makes sense.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marcos Pegoraro 2026-01-29 19:29:32 Re: Document NULL
Previous Message ocean_li_996 2026-01-29 19:27:03 Re: Fix logical decoding not track transaction during SNAPBUILD_BUILDING_SNAPSHOT