Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date: 2025-09-02 23:54:07
Message-ID: tvvtfoxz5ykpsctxjbzxg3nldnzfc7geplrt2z2s54pmgto27y@hbijsndifu45
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Wed, 27 Aug 2025 08:50:15 -0400
> Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

LGTM.

> From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Wed, 27 Aug 2025 10:07:29 -0400
> Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

LGTM.

> From 07f31099754636ec9dabf6cca06c33c4b19c230c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v8 03/22] Eliminate xl_heap_visible in COPY FREEZE
>
> Instead of emitting a separate WAL record for setting the VM bits in
> xl_heap_visible, specify the changes to make to the VM block in the
> xl_heap_multi_insert record instead.
>
> Author: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Reviewed-by: Kirill Reshke <reshkekirill(at)gmail(dot)com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com

> + /*
> + * If we're only adding already frozen rows to a previously empty
> + * page, mark it as all-frozen and update the visibility map. We're
> + * already holding a pin on the vmbuffer.
> + */
> else if (all_frozen_set)
> + {
> + Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer), vmbuffer));
> PageSetAllVisible(page);
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> + visibilitymap_set_vmbyte(relation,
> + BufferGetBlockNumber(buffer),
> + vmbuffer,
> + VISIBILITYMAP_ALL_VISIBLE |
> + VISIBILITYMAP_ALL_FROZEN);
> + }

From an abstraction POV I don't love that heapam now is responsible for
acquiring and releasing the lock. But that ship already kind of has sailed, as
heapam.c is already responsible for releasing the vm buffer etc...

I've wondered about splitting the responsibilities up into multiple
visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
lock and set the LSN. But it's probably not worth it.

> + /*
> + * Now read and update the VM block. Even if we skipped updating the heap
> + * page due to the file being dropped or truncated later in recovery, it's
> + * still safe to update the visibility map. Any WAL record that clears
> + * the visibility map bit does so before checking the page LSN, so any
> + * bits that need to be cleared will still be cleared.
> + *
> + * It is only okay to set the VM bits without holding the heap page lock
> + * because we can expect no other writers of this page.
> + */
> + if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> + XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> + &vmbuffer) == BLK_NEEDS_REDO)
> + {
> + Relation reln = CreateFakeRelcacheEntry(rlocator);
> +
> + Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> + visibilitymap_set_vmbyte(reln, blkno,
> + vmbuffer,
> + VISIBILITYMAP_ALL_VISIBLE |
> + VISIBILITYMAP_ALL_FROZEN);
> +
> + /*
> + * It is not possible that the VM was already set for this heap page,
> + * so the vmbuffer must have been modified and marked dirty.
> + */
> + Assert(BufferIsDirty(vmbuffer));

How about making visibilitymap_set_vmbyte() return whether it needed to do
something? This seems somewhat indirect...

I think it might be good to encapsulate this code into a helper in
visibilitymap.c, there will be more callers in the subsequent patches.

> +/*
> + * Set flags in the VM block contained in the passed in vmBuf.
> + *
> + * This function is for callers which include the VM changes in the same WAL
> + * record as the modifications of the heap page which rendered it all-visible.
> + * Callers separately logging the VM changes should invoke visibilitymap_set()
> + * instead.
> + *
> + * Caller must have pinned and exclusive locked the correct block of the VM in
> + * vmBuf. This block should contain the VM bits for the given heapBlk.
> + *
> + * During normal operation (i.e. not recovery), this should be called in a
> + * critical section which also makes any necessary changes to the heap page
> + * and, if relevant, emits WAL.
> + *
> + * Caller is responsible for WAL logging the changes to the VM buffer and for
> + * making any changes needed to the associated heap page. This includes
> + * maintaining any invariants such as ensuring the buffer containing heapBlk
> + * is pinned and exclusive locked.
> + */
> +uint8
> +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> + Buffer vmBuf, uint8 flags)

Why is it named vmbyte? This actually just sets the two bits corresponding to
the buffer, not the entire byte. So it seems somewhat misleading to reference
byte.

> From dc318358572f61efbd0e05aae2b9a077b422bcf5 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Wed, 18 Jun 2025 12:42:13 -0400
> Subject: [PATCH v8 05/22] Eliminate xl_heap_visible from vacuum phase III
>
> Instead of emitting a separate xl_heap_visible record for each page that
> is rendered all-visible by vacuum's third phase, include the updates to
> the VM in the already emitted xl_heap_prune record.

Reading through the change I didn't particularly like that there's another
optional field in xl_heap_prune, as it seemed liked something that should be
encoded in flags. Of course there aren't enough flag bits available. But
that made me look at the rest of the record: Uh, what do we use the reason
field for? As far as I can tell f83d709760d8 added it without introducing any
users? It doesn't even seem to be set.

> @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
>
> /*
> - * We are about to remove and/or freeze tuples. In Hot Standby mode,
> - * ensure that there are no queries running for which the removed tuples
> - * are still visible or which still consider the frozen xids as running.
> - * The conflict horizon XID comes after xl_heap_prune.
> + * After xl_heap_prune is the optional snapshot conflict horizon.
> + *
> + * In Hot Standby mode, we must ensure that there are no running queries
> + * which would conflict with the changes in this record. If pruning, that
> + * means we cannot remove tuples still visible to transactions on the
> + * standby. If freezing, that means we cannot freeze tuples with xids that
> + * are still considered running on the standby. And for setting the VM, we
> + * cannot do so if the page isn't all-visible to all transactions on the
> + * standby.
> */

I'm a bit confused by this new comment - it sounds like we're deciding whether
to remove tuple versions, but that decision has long been made, no?

> @@ -2846,8 +2848,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> OffsetNumber unused[MaxHeapTuplesPerPage];
> int nunused = 0;
> TransactionId visibility_cutoff_xid;
> + TransactionId conflict_xid = InvalidTransactionId;
> bool all_frozen;
> LVSavedErrInfo saved_err_info;
> + uint8 vmflags = 0;
> + bool set_pd_all_vis = false;
>
> Assert(vacrel->do_index_vacuuming);
>
> @@ -2858,6 +2863,20 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
> InvalidOffsetNumber);
>
> + if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> + vacrel->cutoffs.OldestXmin,
> + deadoffsets, num_offsets,
> + &all_frozen, &visibility_cutoff_xid,
> + &vacrel->offnum))
> + {
> + vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> + if (all_frozen)
> + {
> + vmflags |= VISIBILITYMAP_ALL_FROZEN;
> + Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> + }
> + }
> +
> START_CRIT_SECTION();

I am rather confused - we never can set all-visible if there are any LP_DEAD
items left. If the idea is that we are removing the LP_DEAD items in
lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
vacuum visited the page and collected dead items?

Ugh, I see - it works because we pass in the set of dead items. I think that
makes the name *really* misleading, it's not except LP_DEAD, it's except the
offsets passed in, no?

But then you actually check that the set of dead items didn't change - what
guarantees that?

I didn't look at the later patches, except that I did notice this:

> @@ -268,7 +264,7 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> Relation reln = CreateFakeRelcacheEntry(rlocator);
>
> visibilitymap_pin(reln, blkno, &vmbuffer);
> - old_vmbits = visibilitymap_set_vmbyte(reln, blkno, vmbuffer, vmflags);
> + old_vmbits = visibilitymap_set(reln, blkno, vmbuffer, vmflags);
> /* Only set VM page LSN if we modified the page */
> if (old_vmbits != vmflags)
> PageSetLSN(BufferGetPage(vmbuffer), lsn);
> @@ -279,143 +275,6 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> UnlockReleaseBuffer(vmbuffer);
> }

Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
have done so, as you noticed in one of the early on patches?

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2025-09-03 00:08:10 Re: COPY TO: provide hint when WHERE clause is used
Previous Message Fujii Masao 2025-09-02 23:38:31 Re: COPY TO: provide hint when WHERE clause is used