Quick Links

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Xuneng Zhou <xunengzhou(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date:	2026-03-02 23:38:07
Message-ID:	CAAKRu_Y1MuANdm1p47Ev13Y9EQz8z+pw-vHOh=3DVdahUTjgXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Feb 20, 2026 at 12:59 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2026-01-28 18:16:10 -0500, Melanie Plageman wrote:
>
> > I could see an argument for moving identify_and_fix_vm_corruption()
> > out of the helper and into heap_page_prune_and_freeze() but then we'd
> > have to move visibilitymap_get_status() out too. And that takes away a
> > lot of the benefit of encapsulating all that logic.
>
> I was wondering about that option. Relatedly, I also was wondering if we ought
> to do identify_and_fix_vm_corruption() regardless of ->attempt_update_vm.

Attached v35 does this. I always pin the vmbuffer if we are going to
prune in heap_page_prune_opt(). In many cases, because it's saved in
the scan descriptor, it won't actually need to take a new pin. During
pruning, I check for VM corruption even if I am not considering
setting the VM.

> > Well, after this patch set, clearing the VM does happen before we emit
> > WAL for pruning.
>
> That I think is a substantial improvement, the current (i.e. before your
> series) placement really is pretty insane due to the guaranteed divergence it
> causes.
>
> I wonder if we actually should just force an FPI whenever we detect such
> corruption, that way it would reliably fixed on the standby as well.

Only problem is we would have to do an FPI of the VM page as well if
we wanted the corruption to be reliably fixed on the standby.

> > It wouldn't be hard to move the corruption fixups to the beginning of
> > heap_page_prune_and_freeze() in the new code structure.
>
> As identify_and_fix_vm_corruption() needs lpdead_items, I'm not sure that's
> true?
>
> I wonder if at least the warning for the "(PageIsAllVisible(heap_page) &&
> nlpdead_items > 0)" test should be moved to
> heap_prune_record_dead_or_unused(). That way the WARNING could include the
> offset number and it'd also work in the mark_unused_now case.
>
> Perhaps it also should trigger for RECENTLY_DEAD, INSERT_IN_PROGRESS,
> DELETE_IN_PROGRESS?
>
> At that point the !page_all_visible && vm_all_visible part could indeed be
> moved to the start of heap_page_prune_and_freeze()

I've done all this. There is heap page/VM corruption check at the
beginning of heap_page_prune_and_freeze() and then checking for
corruption during pruning in the previously covered case (lpdead
items) as well as the mark_unused_now case, and
RECENTLY_DEAD/INSERT_IN_PROGRESS/DELETE_IN_PROGRESS.

> > Would it be worth it? What benefit would we get? Do you just feel that it
> > should logically come first?
>
> One insanity is that right now we will process all frozen pages over and over
> due to he skip pages threshold, wasting a *lot* of CPU and memory bandwidth.
> It'd be quite defensible to just skip processing the page once we determined
> it's already all frozen. But for that we'd probably want to do the
> "page_all_visible && vm_all_visible" check before returning...

I've added a fast path to bypass pruning/freezing when the page is
already all-visible. And I check for pg_all_visible && vm_all_visible
beforehand. The one downside this has is if there is a page marked
all-frozen but has dead tuples on it, we'll never get to fix that
corruption nor clean up the dead tuples. But the fast path kind of
seems worth it to me.

> > > Do we actually forsee a case where only one of HEAP_PAGE_PRUNE_FREEZE |
> > > HEAP_PAGE_PRUNE_UPDATE_VM would be set?
> >
> > Yes, when setting the VM on-access, it is too expensive to call
> > heap_prepare_freeze_tuple() on each tuple. I could work on trying to
> > optimize it, but it isn't currently viable.
>
> Is it too expensive to do so even when we already decided to do some pruning?
> I am not surprised it's too expensive when there's not even a dead tuple on
> the page. But I am mildly surprised if it's too expensive to do when we'd WAL
> log anyway?

It's not really possible in the current code structure to only call
heap_prepare_freeze_tuple() when there are at least some prunable
tuples. We go through the line pointers and record them as prunable at
the same time we call heap_prepare_freeze_tuple(), so we won't know
until we've examined all line pointers that there are no prunable
tuples, at which point we will have called heap_prepare_freeze_tuple()
for every tuple.

> > I think using all_frozen_except_dead while maintaining
> > visibility_cutoff_xid (in heap_prune_record_unchanged_lp_normal()) has
> > the potential to be confusing, though. We'd need to keep updating
> > visibility_cutoff_xid when all_visible is false but
> > all_frozen_except_dead is true as well as when all_visible is true.
> > And because we don't care about all_visible_except_dead, it gets even
> > more confusing to make sure we are maintaining the right variables in
> > the right situations.
>
> I suspect we should just track all of the horizons/cutoffs all the time. This
> whole stuff about optimizing out a few conditional assignments complicates the
> code substantially and feels extremely error prone to me.

I've done this in v35. I posted the freeze horizon tracking patch
separately in [1] but it is in v35 as 0004. Tracking the newest live
xid is in 0009. This also always tracks all_visible for all callers
since I unconditionally pass the vmbuffer now. I still don't set the
VM if the query is modifying the relation, though.

> I probably complained about this before, and it's not this patch's fault, but
> PruneState->{all_visible,all_frozen} are imo confusingly named, due to
> sounding like they describe the current state, rather than the possible state
> after pruning. It's not helped by this comment:
>
> * NOTE: all_visible and all_frozen initially don't include LP_DEAD items.
> * That's convenient for heap_page_prune_and_freeze() to use them to
> * decide whether to opportunistically freeze the page or not. The
> * all_visible and all_frozen values ultimately used to set the VM are
> * adjusted to include LP_DEAD items after we determine whether or not to
> * opportunistically freeze.
>
> "all-visible ... are adjusted to include LP_DEAD" ... - just reading that it's
> hard to know what it means.

0003 does the rename.

> The first thing to improve pruning performance that I would do is to introduce
> a fastpath for pages that a) area already frozen b) do not have dead items (if
> we're not freezing). Iterating through HOT chains is far from cheap, and if
> all rows are live, there's not really a point in doing so. This is
> particulary important for VACUUMs where we end up freezing a ton of pages that
> are already frozen, due to the silly skip_pages_threshold thing.

0007 adds a fast path.

> > +static TransactionId
> > +get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
> > + uint8 old_vmbits, uint8 new_vmbits,
> > + TransactionId latest_xid_removed, TransactionId frz_conflict_horizon,
> > + TransactionId visibility_cutoff_xid)
> > +{
> > + TransactionId conflict_xid;
> > +
> > + /*
> > + * We can omit the snapshot conflict horizon if we are not pruning or
> > + * freezing any tuples and are setting an already all-visible page
> > + * all-frozen in the VM.
>
> Maybe mention when this can happen, because it's not immediately obvious.

I've added this to my TODO. I honestly can't think of a scenario where
it can happen. But I remember spending quite a bit of time thinking
about it on another occasion. The current code (in master) does
specifically account for this scenario, which is why I kept the logic,
but I'm not sure how it can happen.

I made all the other changes to specific comments you mentioned in
your mail but I won't bore you with itemization.

> > if (do_set_vm)
> > conflict_xid = visibility_cutoff_xid;
> > else if (do_freeze)
> > conflict_xid = frz_conflict_horizon;
> > else
> > conflict_xid = InvalidTransactionId;
>
> Could it be worth checking that if (do_set_vm && do_freeze) the
> frz_conflict_horizon won't "violated" by using visibility_cutoff_xid instead?

Yes, as you mentioned off-list, this wasn't right. New code is like this

TransactionId conflict_xid = InvalidTransactionId;
...
if (do_set_vm)
conflict_xid = newest_live_xid;
if (do_freeze && TransactionIdFollows(newest_frozen_xid, conflict_xid))
conflict_xid = newest_frozen_xid;

> > From 8d350868206456f631883a40a955dff480e408d3 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Wed, 17 Dec 2025 16:51:05 -0500
> > Subject: [PATCH v34 09/14] Use GlobalVisState in vacuum to determine page
> > level visibility
> >
> > [...]
> >
> > Because comparing a transaction ID against GlobalVisState is more
> > expensive than comparing against a single XID, we defer this check until
> > after scanning all tuples on the page.
>
> Curious, is this a precaution or was this a measurable bottleneck?

I did see GlobalVisTestXidMaybeRunning() in a profile I did when it
was still called for every HEAPTUPLE_LIVE tuple in
heap_prune_record_unchanged_lp_normal(), but I don't have the profile
or test case around anymore.

However, since I now unconditionally maintain the newest_live_xid,
moving GlobalVisTestXidMaybeRunning() back into
heap_prune_record_unchanged_lp_normal() wouldn't help us avoid any
work. It would just make the values of prstate.set_all_visible and
prstate.set_all_frozen more accurate sooner. But I don't think it's
worth the extra function call since set_all_frozen and set_all_visible
won't be totally "done" until after we decide whether or not to
opportunistically freeze anyway.

> > @@ -1077,6 +1078,24 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
> > prune_freeze_plan(RelationGetRelid(params->relation),
> > buffer, &prstate, off_loc);
> >
> > + /*
> > + * After processing all the live tuples on the page, if the newest xmin
> > + * amongst them may be considered running by any snapshot, the page cannot
> > + * be all-visible.
> > + */
> > + if (prstate.all_visible &&
> > + TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
>
> Any reason to test IsNormal rather than just IsValid()? There should never be
> a reason it's a valid but not "normal" xid, right?

Well the reason I did this was that the existing code in master
tracking visibility_cutoff_xid only advances it if
TransactionIdIsNormal(). I'm a bit confused about it too because it
seems like we would still want to do it for bootstrap mode xids. But I
see PageSetPrunable() only allows normal xids.

> > @@ -1794,28 +1812,15 @@ heap_prune_record_unchanged_lp_normal(Page page, PruneState *prstate, OffsetNumb
> > }
> >
> > /*
> > - * The inserter definitely committed. But is it old enough
> > - * that everyone sees it as committed? A FrozenTransactionId
> > - * is seen as committed to everyone. Otherwise, we check if
> > - * there is a snapshot that considers this xid to still be
> > - * running, and if so, we don't consider the page all-visible.
> > + * The inserter definitely committed. But we don't know if it
> > + * is old enough that everyone sees it as committed. Later,
> > + * after processing all the tuples on the page, we'll check if
> > + * there is any snapshot that still considers the newest xid
> > + * on the page to be running. If so, we don't consider the
> > + * page all-visible.
> > */
> > xmin = HeapTupleHeaderGetXmin(htup);
> >
> > - /*
> > - * For now always use prstate->cutoffs for this test, because
> > - * we only update 'all_visible' and 'all_frozen' when freezing
> > - * is requested. We could use GlobalVisTestIsRemovableXid
> > - * instead, if a non-freezing caller wanted to set the VM bit.
> > - */
> > - Assert(prstate->cutoffs);
> > - if (!TransactionIdPrecedes(xmin, prstate->cutoffs->OldestXmin))
> > - {
> > - prstate->all_visible = false;
> > - prstate->all_frozen = false;
> > - break;
> > - }
> > -
> > /* Track newest xmin on page. */
> > if (TransactionIdFollows(xmin, prstate->visibility_cutoff_xid) &&
> > TransactionIdIsNormal(xmin))
>
> Kinda wonder if this cod eshould be in something like
> heap_prune_record_freezable() or such, rather than be inside
> heap_prune_record_unchanged_lp_normal().

I played around with it, but it all felt a bit awkward. I wrote it
down for a future enhancement idea.

> > Subject: [PATCH v34 10/14] Unset all_visible sooner if not freezing
> >
> > In the prune/freeze path, we currently delay clearing all_visible and
> > all_frozen in the presence of dead items to allow opportunistic
> > freezing.
> >
> > However, if no freezing will be attempted, there’s no need to delay.
> > Clearing the flags earlier avoids extra bookkeeping in
> > heap_prune_record_unchanged_lp_normal(). This currently has no runtime
> > effect because all callers that consider setting the VM also prepare
> > freeze plans, but upcoming changes will allow on-access pruning to set
> > the VM without freezing. The extra bookkeeping was noticeable in a
> > profile of on-access VM setting.
>
> What workload was that?

It was a select * offset all query with a few fat tuples on each page
and none of them prunable. I'm planning on digging up the
case/creating a new one to see if it is reproducible. This was with an
older version of the code that had more conditionals as well. This
commit is actually dropped in v35 because I now always keep
newest_live_xid up-to-date (0009) which means unsetting
set_all_visible sooner has no benefit.

> Theoretically, even if we don't freeze, the page still may be all-visible or
> all frozen after the removal of dead items, no? Practically that won't happen,
> because we don't remove dead items in any of the relevant paths, but from the
> commit message and comments that's not entirely clear.

Yea, it's clearer with the commit dropped.

> > @@ -678,6 +678,12 @@ typedef struct EState
> > * ExecDoInitialPruning() */
> > const char *es_sourceText; /* Source text from QueryDesc */
> >
> > + /*
> > + * RT indexes of relations modified by the query through a
> > + * UPDATE/DELETE/INSERT/MERGE or targeted by a SELECT FOR UPDATE.
> > + */
> > + Bitmapset *es_modified_relids;
> > +
>
> Other EState fields are initialized in CreateExecutorState, this isn't afaict?

Oops, yes. I based it on es_unpruned_relids which wasn't initialized
there either. I've added a commit (0013) to initialize a few EState
fields that weren't initialized in CreateExecutorState() as well.

> Wonder if it's worth adding a crosscheck somewhere, verifying that if a
> relation is modified, it's in es_modified_relids. Otherwise this could very
> well silently get out of date.

Done in v35 (0014).

> Also, there's some overlap between the informtion collected this way, and
> AcquireExecutorLocks(), ScanQueryForLocks(), which determine the needed lock
> modes via rte->rellockmode.

Those are in parser/planner, so it doesn't seem like a good fit. I
populate es_modified_relids in the executor.

I don't know exactly what the overlap would be between RTEs with an
exclusive rellockmode and es_modified_relids. It seems like you could
have RTEs which don't end up getting modified that have a lock level
that would have made you think that they would be modified.

But were you imagining a substitution or a cross-check?

> > From 8205b2d7da0c3ad3cbc5cead336ced677996b37d Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Wed, 3 Dec 2025 15:12:18 -0500
> > Subject: [PATCH v34 12/14] Pass down information on table modification to scan
> > node
>
> Perhaps worth splitting up, so the addition of the 0 flag is separate from the
> the read only hint aspect.

Done.

[1] https://www.postgresql.org/message-id/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com

Attachment	Content-Type	Size
v35-0001-Move-commonly-used-context-into-PruneState-and-s.patch	text/x-patch	16.4 KB
v35-0002-Add-PageGetPruneXid-helper.patch	text/x-patch	1.9 KB
v35-0003-Rename-PruneState-all_visible-all_frozen.patch	text/x-patch	13.7 KB
v35-0004-Use-the-newest-to-be-frozen-xid-as-the-conflict-.patch	text/x-patch	8.0 KB
v35-0005-Save-vmbuffer-in-heap-specific-scan-descriptors-.patch	text/x-patch	5.9 KB
v35-0006-Fix-visibility-map-corruption-in-more-cases.patch	text/x-patch	18.3 KB
v35-0007-Add-pruning-fast-path-for-all-visible-and-all-fr.patch	text/x-patch	4.3 KB
v35-0008-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch	text/x-patch	11.4 KB
v35-0009-Keep-newest-live-XID-up-to-date-even-if-page-not.patch	text/x-patch	14.8 KB
v35-0010-Eliminate-XLOG_HEAP2_VISIBLE-from-vacuum-phase-I.patch	text/x-patch	27.3 KB
v35-0011-Eliminate-XLOG_HEAP2_VISIBLE-from-empty-page-vac.patch	text/x-patch	2.6 KB
v35-0012-Remove-XLOG_HEAP2_VISIBLE-entirely.patch	text/x-patch	25.0 KB
v35-0013-Initialize-missing-fields-in-CreateExecutorState.patch	text/x-patch	924 bytes
v35-0014-Track-which-relations-are-modified-by-a-query.patch	text/x-patch	5.4 KB
v35-0015-Make-begin_scan-functions-take-a-flags-argument.patch	text/x-patch	21.2 KB
v35-0016-Pass-down-information-on-table-modification-to-s.patch	text/x-patch	8.0 KB
v35-0017-Allow-on-access-pruning-to-set-pages-all-visible.patch	text/x-patch	9.9 KB
v35-0018-Set-pd_prune_xid-on-insert.patch	text/x-patch	9.3 KB

In response to

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) at 2026-02-20 17:59:24 from Andres Freund

Responses

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) at 2026-03-03 07:32:29 from Chao Li

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Chao Li	2026-03-02 23:41:33	Re: doc: Clarify that empty COMMENT string removes the comment
Previous Message	Zsolt Parragi	2026-03-02 23:19:22	Re: amcheck: add index-all-keys-match verification for B-Tree