| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
| Cc: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Xuneng Zhou <xunengzhou(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
| Date: | 2026-01-24 00:28:46 |
| Message-ID: | 7ib3sa55sapwjlaz4sijbiq7iezna27kjvvvar4dpgkmadml6t@gfpkkwmdnepx |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-01-06 12:31:57 -0500, Melanie Plageman wrote:
> Subject: [PATCH v33 01/16] Combine visibilitymap_set() cases in
> lazy_scan_prune()
>
> lazy_scan_prune() previously had two separate cases that called
> visibilitymap_set() after pruning and freezing. These branches were
> nearly identical except that one attempted to avoid dirtying the heap
> buffer. However, that situation can never occur — the heap buffer cannot
> be clean at that point (and we would hit an assertion if it were).
>
> In lazy_scan_prune(), when we change a previously all-visible page to
> all-frozen and the page was recorded as all-visible in the visibility
> map by find_next_unskippable_block(), the heap buffer will always be
> dirty. Either we have just frozen a tuple and already dirtied the
> buffer, or the buffer was modified between find_next_unskippable_block()
> and heap_page_prune_and_freeze() and then pruned in
> heap_page_prune_and_freeze().
>
> Additionally, XLogRegisterBuffer() asserts that the buffer is dirty, so
> attempting to add a clean heap buffer to the WAL chain would assert out
> anyway.
>
> Since the “clean heap buffer with already set VM” case is impossible,
> the two visibilitymap_set() branches in lazy_scan_prune() can be merged.
> Doing so makes the intent clearer and emphasizes that the heap buffer
> must always be marked dirty before being added to the WAL chain.
>
> This commit also adds a test case for vacuuming when no heap
> modifications are required. Currently this ensures that the heap buffer
> is marked dirty before it is added to the WAL chain, but if we later
> remove the heap buffer from the VM-set WAL chain or pass it with the
> REGBUF_NO_CHANGES flag, this test would guard that behavior.
>
> Author: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Reviewed-by: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>
> Reviewed-by: Kirill Reshke <reshkekirill(at)gmail(dot)com>
> Reviewed-by: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
> Discussion: https://postgr.es/m/5CEAA162-67B1-44DA-B60D-8B65717E8B05%40gmail.com
> Discussion: https://postgr.es/m/flat/CAAKRu_ZWx5gCbeCf7PWCv8p5%3D%3Db7EEws0VD2wksDxpXCvCyHvQ%40mail.gmail.com
> ---
> .../pg_visibility/expected/pg_visibility.out | 44 ++++++++++
> contrib/pg_visibility/sql/pg_visibility.sql | 20 +++++
> src/backend/access/heap/vacuumlazy.c | 87 ++++---------------
> 3 files changed, 82 insertions(+), 69 deletions(-)
>
> diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
> index 09fa5933a35..e10f1706015 100644
> --- a/contrib/pg_visibility/expected/pg_visibility.out
> +++ b/contrib/pg_visibility/expected/pg_visibility.out
> @@ -1,4 +1,5 @@
> CREATE EXTENSION pg_visibility;
> +CREATE EXTENSION pageinspect;
I think this would need a EXTRA_INSTALL = contrib/pageinspect to work reliably
in make. You should be able to see a failure without the fix if you remove
the tmp_install/ dir in an autoconf build and then run make check for
pg_visibility.
I'm slightly wary of embedding numerical bitmaks values in the tests, but I
don't see a better alternative right now.
Other than the EXTRA_INSTALL thing I think this is ready.
> From 4d37243f9fa0dc4e264a28bcee448787fb8d7f65 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Thu, 11 Dec 2025 10:48:13 -0500
> Subject: [PATCH v33 02/16] Eliminate use of cached VM value in
> lazy_scan_prune()
>
> lazy_scan_prune() takes a parameter from lazy_scan_heap() indicating
> whether the page was marked all-visible in the VM at the time it was
> last checked in find_next_unskippable_block(). This behavior is
> historical, dating back to commit 608195a3a365, when we did not pin the
> VM page until deciding we must read it. Now that the VM page is already
> pinned, there is no meaningful benefit to relying on a cached VM status.
>
> Removing this cached value simplifies the logic in both lazy_scan_heap()
> and lazy_scan_prune(). It also clarifies future work that will set the
> visibility map on-access: such paths will not have a cached value
> available, which would make the logic harder to reason about. And
> eliminating it enables us to detect and repair VM corruption on-access.
>
> Along with removing the cached value and unconditionally checking the
> visibility status of the heap page, this commit also moves the VM
> corruption handling to occur first. This reordering should have no
> performance impact, since the checks are inexpensive and performed only
> once per page. It does, however, make the control flow easier to
> understand. The new restructuring also makes it possible to set the VM
> after fixing corruption (if pruning found the page all-visible).
>
> Now that no callers of visibilitymap_set() use its return value, change
> its (and visibilitymap_set_vmbits()) return type to void.
> @@ -1735,7 +1719,6 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
> BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
> Buffer next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
> bool next_unskippable_eager_scanned = false;
> - bool next_unskippable_allvis;
>
> *skipsallvis = false;
>
> @@ -1745,7 +1728,6 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
> next_unskippable_block,
> &next_unskippable_vmbuffer);
>
> - next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
>
> /*
> * At the start of each eager scan region, normal vacuums with eager
> @@ -1764,7 +1746,7 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
> * A block is unskippable if it is not all visible according to the
> * visibility map.
> */
> - if (!next_unskippable_allvis)
> + if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> {
> Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
> break;
This feels a bit independent from the rest, but it doesn't matter.
> @@ -2225,6 +2144,71 @@ lazy_scan_prune(LVRelState *vacrel,
> MarkBufferDirty(buf);
> visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
> VISIBILITYMAP_VALID_BITS);
> + /* VM bits are now clear */
> + old_vmbits = 0;
> + }
> +
> + if (!presult.all_visible)
> + return presult.ndeleted;
> +
> + /* Set the visibility map and page visibility hint */
> + new_vmbits = VISIBILITYMAP_ALL_VISIBLE;
> +
> + if (presult.all_frozen)
> + new_vmbits |= VISIBILITYMAP_ALL_FROZEN;
> +
> + /* Nothing to do */
> + if (old_vmbits == new_vmbits)
> + return presult.ndeleted;
>
> + Assert(presult.all_visible);
Given that there's an explicit return for this case a few lines above, I don't
understand what the assert is trying to do?
> + /*
> + * It should never be the case that the visibility map page is set while
> + * the page-level bit is clear
I'd perhaps add a parenthetical saying (and if so, we cleared it above) or
such.
> + visibilitymap_set(vacrel->rel, blkno, buf,
> + InvalidXLogRecPtr,
> + vmbuffer, presult.vm_conflict_horizon,
> + new_vmbits);
> +
> + /*
> + * If the page wasn't already set all-visible and/or all-frozen in the VM,
> + * count it as newly set for logging.
> + */
> + if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> + {
> + vacrel->vm_new_visible_pages++;
> + if (presult.all_frozen)
> + {
> + vacrel->vm_new_visible_frozen_pages++;
> + *vm_page_frozen = true;
Not this patches fault, but I find "vm_new_visible_pages" and
"vm_new_visible_frozen_pages" pretty odd names. The concept is all-visible and
frozen. The page itself isn't visible or invisible...
> Subject: [PATCH v33 03/16] Refactor lazy_scan_prune() VM clear logic into
> helper
> +/*
> + * Helper to correct any corruption detected on a heap page and its
> + * corresponding visibility map page after pruning but before setting the
> + * visibility map. It examines the heap page, the associated VM page, and the
> + * number of dead items previously identified.
> + *
> + * This function must be called while holding an exclusive lock on the heap
> + * buffer, and the dead items must have been discovered under that same lock.
> +
> + * The provided vmbits must reflect the current state of the VM block
> + * referenced by vmbuffer. Although we do not hold a lock on the VM buffer, it
> + * is pinned, and the heap buffer is exclusively locked, ensuring that no
> + * other backend can update the VM bits corresponding to this heap page.
> + *
> + * Returns true if it cleared corruption and false otherwise.
> + */
I don't love that the caller now has to assume that old_vmbits is zero after
this function returns. Somehow that feels like split responsiility.
I guess I'd take a pointer to old_vmbits and update it accordingly inside
identify_and_fix_vm_corruption() rather than in the caller.
> From 5c65e73246b4968ddfa9d3739f53d0d8734b8727 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Tue, 2 Dec 2025 15:07:42 -0500
> Subject: [PATCH v33 04/16] Set the VM in heap_page_prune_and_freeze()
>
> This has no independent benefit. It is meant for ease of review. As of
> this commit, there is still a separate WAL record emitted for setting
> the VM after pruning and freezing. But it is easier to review if moving
> the logic into pruneheap.c is separate from setting the VM in the same
> WAL record.
It seems a bit noisy to refactor the related code in some of the preceding
commits and then refactor it into a slightly different shape as part of this
commit (c.f. heap_page_will_set_vm()).
It's also a bit odd that a function that sounds rather read-only does stuff
like clearing VM/all-visible.
Why are we not doing fixing up of the page *before* we prune it? It's a bit
insane that we do the WAL logging for pruning, which in turn will often
include an FPI, before we do the fixups. The fixes aren't WAL logged, so this
actually leads to the standby getting further out of sync.
I realize this isn't your mess, but brrr.
Do we actually forsee a case where only one of HEAP_PAGE_PRUNE_FREEZE |
HEAP_PAGE_PRUNE_UPDATE_VM would be set?
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -424,11 +424,7 @@ static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
> static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
> BlockNumber blkno, Page page,
> bool sharelock, Buffer vmbuffer);
> -static bool identify_and_fix_vm_corruption(Relation rel, Buffer heap_buffer,
> - BlockNumber heap_blk, Page heap_page,
> - int nlpdead_items,
> - Buffer vmbuffer,
> - uint8 vmbits);
> +
> static int lazy_scan_prune(LVRelState *vacrel, Buffer buf,
> BlockNumber blkno, Page page,
> Buffer vmbuffer,
> @@ -1962,83 +1958,6 @@ cmpOffsetNumbers(const void *a, const void *b)
> return pg_cmp_u16(*(const OffsetNumber *) a, *(const OffsetNumber *) b);
> }
Spurious newline inserted.
> /*
> * If the page wasn't already set all-visible and/or all-frozen in the VM,
> * count it as newly set for logging.
> */
> - if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> + if ((presult.old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0 &&
> + (presult.new_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> {
> vacrel->vm_new_visible_pages++;
> - if (presult.all_frozen)
> + if ((presult.new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> {
> vacrel->vm_new_visible_frozen_pages++;
> *vm_page_frozen = true;
> }
> }
> - else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
> - presult.all_frozen)
> + else if ((presult.old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
> + (presult.new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> {
> + Assert((presult.new_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0);
> vacrel->vm_new_frozen_pages++;
> *vm_page_frozen = true;
> }
It's a bit odd that we figure out all of this by inspecting old/new vmbits and
have that logic in multiple places.
> Subject: [PATCH v33 05/16] Move VM assert into prune/freeze code
Feels like a somewhat too narrow description, given that it changes the API
for heap_page_prune_and_freeze() by removing variables from PruneFreezeResult.
> +#ifdef USE_ASSERT_CHECKING
> +
> +/*
> + * Wrapper for heap_page_would_be_all_visible() which can be used for callers
> + * that expect no LP_DEAD on the page. Currently assert-only, but there is no
> + * reason not to use it outside of asserts.
> + */
If so, why would we want it in pruneheap.c? Seems a bit odd to have
heap_page_would_be_all_visible() defined in vacuumlazy.c but defined
heap_page_is_all_visible() in pruneheap.c.
> From cdf5776fadeae3430c692999b37f8a7ec944bda1 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Tue, 2 Dec 2025 16:16:22 -0500
> Subject: [PATCH v33 06/16] Eliminate XLOG_HEAP2_VISIBLE from vacuum phase I
> prune/freeze
>
> Vacuum no longer emits a separate WAL record for each page set
> all-visible or all-frozen during phase I. Instead, visibility map
> updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
> is already emitted for pruning and freezing.
>
> Previously, heap_page_prune_and_freeze() determined whether a page was
> all-visible, but the corresponding VM bits were only set later in
> lazy_scan_prune(). Now the VM is updated immediately in
> heap_page_prune_and_freeze(), at the same time as the heap
> modifications.
>
> This change applies only to vacuum phase I, not to pruning performed
> during normal page access.
> +/*
> + * Calculate the conflict horizon for the whole XLOG_HEAP2_PRUNE_VACUUM_SCAN
> + * or XLOG_HEAP2_PRUNE_ON_ACCESS record.
> + */
> +static TransactionId
> +get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
> + uint8 old_vmbits, uint8 new_vmbits,
> + TransactionId latest_xid_removed, TransactionId frz_conflict_horizon,
> + TransactionId visibility_cutoff_xid)
> +{
The logic for horizons is now split between this and "Calculate what the
snapshot conflict horizon should be for a record" in heap_page_will_freeze().
Although I guess I don't understand that code:
/*
* Calculate what the snapshot conflict horizon should be for a record
* freezing tuples. We can use the visibility_cutoff_xid as our cutoff
* for conflicts when the whole page is eligible to become all-frozen
* in the VM once we're done with it. Otherwise, we generate a
* conservative cutoff by stepping back from OldestXmin.
*/
if (prstate->all_frozen)
prstate->frz_conflict_horizon = prstate->visibility_cutoff_xid;
else
{
/* Avoids false conflicts when hot_standby_feedback in use */
prstate->frz_conflict_horizon = prstate->cutoffs->OldestXmin;
TransactionIdRetreat(prstate->frz_conflict_horizon);
}
Why does it make sense to use OldestXmin? Consider e.g. the case where there
is one very old tuple that needs to be frozen and one new live tuple on a
page. Because of the new tuple we can't mark the page all-frozen. But there's
also no reason to not use much less aggressive horizon than OldestXmin, namely
the newer of xmin,xmax of the old frozen tuple?
I also don't understand what the "false conflicts" thing is referencing.
> + TransactionId conflict_xid;
> +
> + /*
> + * We can omit the snapshot conflict horizon if we are not pruning or
> + * freezing any tuples and are setting an already all-visible page
> + * all-frozen in the VM. In this case, all of the tuples on the page must
> + * already be visible to all MVCC snapshots on the standby.
> + */
The last sentence here is a bit confusing, because they don't just need to
already be visible to everyone, they already need to be frozen. Right?
> + if (!do_prune &&
> + !do_freeze &&
> + do_set_vm &&
I'm confused by the do_set_vm check here. Doesn't it mean that we will *not*
return InvalidTransactionId if !prstate->attempt_update_vm? I don't undestand
why that would make sense.
I guess we'll compute a bogus cutoff in that cse, but never use it, since
we'll also not emit WAL? Or maybe we'll just unnecessarily go through the
code below, because the code turns out to do ok regardles? It's confusing
either way.
> + (old_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0 &&
> + (new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
I wonder if some of this code would end up cleaner if we tracked the bits we
intend to add, rather than the target set of bits is.
> + /*
> + * The snapshotConflictHorizon for the whole record should be the most
> + * conservative of all the horizons calculated for any of the possible
> + * modifications. If this record will prune tuples, any transactions on
> + * the standby older than the youngest xmax of the most recently removed
> + * tuple this record will prune will conflict. If this record will freeze
> + * tuples, any transactions on the standby with xids older than the
> + * youngest tuple this record will freeze will conflict.
> + */
> + conflict_xid = InvalidTransactionId;
I'd move this first assignment into an else.
> + /*
> + * If we are updating the VM, the conflict horizon is almost always the
> + * visibility cutoff XID.
> + *
> + * Separately, if we are freezing any tuples, as an optimization, we can
> + * use the visibility_cutoff_xid as the conflict horizon if the page will
> + * be all-frozen.
What does "as an optimization" mean here?
Note that the code actually uses visibility_cutoff_xid even if the page is
just marked all-visible, but not all-frozen (due to the do_set_vm check being
earlier)
> This is true even if there are LP_DEAD line pointers
> + * because we ignored those when maintaining the visibility_cutoff_xid.
I must just be missing something because I can't follow this at all. I guess
it could be correct because we later then add in knowledge of removed xids in
via the TransactionIdFollows check below? But if that's it, this is extremely
confusingly worded.
Sorry, running out of brain power. More another day.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David G. Johnston | 2026-01-24 00:57:19 | Re: docs: clarify ALTER TABLE behavior on partitioned tables |
| Previous Message | Jacob Champion | 2026-01-24 00:04:39 | Re: Custom oauth validator options |