| From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Xuneng Zhou <xunengzhou(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
| Date: | 2026-01-28 23:16:10 |
| Message-ID: | CAAKRu_bs+gZ83QDacmBxunPvCGnXJ05hxP2BDPJ3BGwdbGRXzg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Thanks for the review!
I pushed v33 0001-0003 after incorporating your feedback.
On Fri, Jan 23, 2026 at 7:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2026-01-06 12:31:57 -0500, Melanie Plageman wrote:
>
> > + */
> > + if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> > + {
> > + vacrel->vm_new_visible_pages++;
> > + if (presult.all_frozen)
> > + {
> > + vacrel->vm_new_visible_frozen_pages++;
> > + *vm_page_frozen = true;
>
> Not this patches fault, but I find "vm_new_visible_pages" and
> "vm_new_visible_frozen_pages" pretty odd names. The concept is all-visible and
> frozen. The page itself isn't visible or invisible...
I thought having the extra word "all" in there made it too long. And
since "vm" is there, that isn't set unless the page is
_all_-visible/all-frozen. But if you think it gives people the wrong
idea, I am willing to change it. I can omit vm and make it:
new_all_visible_all_frozen_pages
new_all_visible_pages
new_all_frozen_pages
Is that clearer?
> > From 5c65e73246b4968ddfa9d3739f53d0d8734b8727 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Tue, 2 Dec 2025 15:07:42 -0500
> > Subject: [PATCH v33 04/16] Set the VM in heap_page_prune_and_freeze()
> >
> > This has no independent benefit. It is meant for ease of review. As of
> > this commit, there is still a separate WAL record emitted for setting
> > the VM after pruning and freezing. But it is easier to review if moving
> > the logic into pruneheap.c is separate from setting the VM in the same
> > WAL record.
>
> It seems a bit noisy to refactor the related code in some of the preceding
> commits and then refactor it into a slightly different shape as part of this
> commit (c.f. heap_page_will_set_vm()).
I understand what you are saying. I don't see a good way to keep it
reviewable otherwise, though.
> It's also a bit odd that a function that sounds rather read-only does stuff
> like clearing VM/all-visible.
I thought about this a lot. Ultimately, I ended up keeping it the way it is.
I think the other option is changing from this:
do_set_vm = heap_page_will_set_vm(&prstate,
params->relation,
blockno, buffer, page,
vmbuffer,
params->reason,
do_prune, do_freeze,
prstate.lpdead_items,
&old_vmbits, &new_vmbits);
to this:
heap_page_prepare_vm_set(&prstate,
params->relation,
blockno, buffer, page,
vmbuffer,
params->reason,
do_prune, do_freeze,
prstate.lpdead_items,
&old_vmbits, &new_vmbits);
do_set_vm = (new_vmbits & VISIBILITYMAP_VALID_BITS) != 0;
or heap_page_plan_vm_set()
heap_page_will_set_vm() has symmetry with heap_page_will_freeze(), the
helper that decides whether or not we will freeze tuples. I like that
symmetry since heap_page_will_set_vm() decides whether or not to set
the VM.
Now, heap_page_plan/prepare_vm_set() does indirectly hint that
something like clearing VM/all-visible could happen -- if you
understand that preparing the VM to have bits set also includes
clearing any existing corruption. And "prepare" or "plan" has more
symmetry with prune_freeze_plan() -- though that function does not
make changes on the page.
Ultimately, clearing the VM/page of corruption is pretty anomalous
from the rest of the code in heap_page_prune_and_freeze(). All other
changes to the page are done in a single critical section at the
bottom of the function.
I could see an argument for moving identify_and_fix_vm_corruption()
out of the helper and into heap_page_prune_and_freeze() but then we'd
have to move visibilitymap_get_status() out too. And that takes away a
lot of the benefit of encapsulating all that logic.
Ultimately, I don't like any of those alternative structures. But if
you prefer the names and return value change I have above
(heap_page_prepare/plan_vm_set()), I'm fine with going with that.
> Why are we not doing fixing up of the page *before* we prune it? It's a bit
> insane that we do the WAL logging for pruning, which in turn will often
> include an FPI, before we do the fixups. The fixes aren't WAL logged, so this
> actually leads to the standby getting further out of sync.
>
> I realize this isn't your mess, but brrr.
Well, after this patch set, clearing the VM does happen before we emit
WAL for pruning. It wouldn't be hard to move the corruption fixups to
the beginning of heap_page_prune_and_freeze() in the new code
structure. But it would split visibility map-related logic into two
parts of heap_page_prune_and_freeze(). Would it be worth it? What
benefit would we get? Do you just feel that it should logically come
first?
> Do we actually forsee a case where only one of HEAP_PAGE_PRUNE_FREEZE |
> HEAP_PAGE_PRUNE_UPDATE_VM would be set?
Yes, when setting the VM on-access, it is too expensive to call
heap_prepare_freeze_tuple() on each tuple. I could work on trying to
optimize it, but it isn't currently viable.
> > - if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> > + if ((presult.old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0 &&
> > + (presult.new_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> > {
> > vacrel->vm_new_visible_pages++;
> > - if (presult.all_frozen)
> > + if ((presult.new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> > {
> > vacrel->vm_new_visible_frozen_pages++;
> > *vm_page_frozen = true;
> > }
> > }
> > - else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
> > - presult.all_frozen)
> > + else if ((presult.old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
> > + (presult.new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> > {
> > + Assert((presult.new_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0);
> > vacrel->vm_new_frozen_pages++;
> > *vm_page_frozen = true;
> > }
>
> It's a bit odd that we figure out all of this by inspecting old/new vmbits and
> have that logic in multiple places.
I changed PruneFreezeResult to have just the counters that have to be
reflected in vacrel for logging (vm_new_frozen_pages, etc) instead of
passing the bits back.
> > Subject: [PATCH v33 05/16] Move VM assert into prune/freeze code
>
> Feels like a somewhat too narrow description, given that it changes the API
> for heap_page_prune_and_freeze() by removing variables from PruneFreezeResult.
I've tried to fix this.
> > +/*
> > + * Wrapper for heap_page_would_be_all_visible() which can be used for callers
> > + * that expect no LP_DEAD on the page. Currently assert-only, but there is no
> > + * reason not to use it outside of asserts.
> > + */
>
> If so, why would we want it in pruneheap.c? Seems a bit odd to have
> heap_page_would_be_all_visible() defined in vacuumlazy.c but defined
> heap_page_is_all_visible() in pruneheap.c.
You're right. I've kept them both in vacuumlazy.c
> > From cdf5776fadeae3430c692999b37f8a7ec944bda1 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Tue, 2 Dec 2025 16:16:22 -0500
> > +static TransactionId
> > +get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
> > + uint8 old_vmbits, uint8 new_vmbits,
> > + TransactionId latest_xid_removed, TransactionId frz_conflict_horizon,
> > + TransactionId visibility_cutoff_xid)
> > +{
>
> The logic for horizons is now split between this and "Calculate what the
> snapshot conflict horizon should be for a record" in heap_page_will_freeze().
That is true in master too. We determine frz_conflict_horizon in
heap_page_will_freeze() and later before emitting the WAL record
decide which of the latest_xid_removed and frz_conflict_horizon that
we should use as the snapshot conflict horizon for the combined
record.
All I've done is expand that part (the part before emitting the WAL
record) a bit because now we have to consider what the horizon would
be if we set the VM.
If I really wanted to calculate it only in a single place, I could
maintain a new variable, all_frozen_except_dead, and remove the
frz_conflict_horizon from heap_page_will_freeze(). Then, in
get_conflict_xid(), I could have the following logic:
if (do_set_vm)
conflict_xid = visibility_cutoff_xid;
else if (do_freeze)
{
if (all_frozen_except_dead)
conflict_xid = visibility_cutoff_xid;
else
{
conflict_xid = OldestXmin;
TransactionIdRetreat(conflict_xid);
}
}
else
conflict_xid = InvalidTransactionId;
I think using all_frozen_except_dead while maintaining
visibility_cutoff_xid (in heap_prune_record_unchanged_lp_normal()) has
the potential to be confusing, though. We'd need to keep updating
visibility_cutoff_xid when all_visible is false but
all_frozen_except_dead is true as well as when all_visible is true.
And because we don't care about all_visible_except_dead, it gets even
more confusing to make sure we are maintaining the right variables in
the right situations.
Alternatively, we could keep maintenance of visibility_cutoff_xid the
same and only use all_frozen_except_dead to avoid having the conflict
xid calculation in two places. We would just set it after
prune_freeze_plan() and use it the way it is in the snippet above. But
I don't know if this is better than just having a separate freeze
conflict horizon calculated in the will_freeze code. It is just as
confusing to understand and just as many variables but in a different
place.
For now, I've kept it as is.
> Although I guess I don't understand that code:
>
> /*
> * Calculate what the snapshot conflict horizon should be for a record
> * freezing tuples. We can use the visibility_cutoff_xid as our cutoff
> * for conflicts when the whole page is eligible to become all-frozen
> * in the VM once we're done with it. Otherwise, we generate a
> * conservative cutoff by stepping back from OldestXmin.
> */
> if (prstate->all_frozen)
> prstate->frz_conflict_horizon = prstate->visibility_cutoff_xid;
> else
> {
> /* Avoids false conflicts when hot_standby_feedback in use */
> prstate->frz_conflict_horizon = prstate->cutoffs->OldestXmin;
> TransactionIdRetreat(prstate->frz_conflict_horizon);
> }
>
> Why does it make sense to use OldestXmin? Consider e.g. the case where there
> is one very old tuple that needs to be frozen and one new live tuple on a
> page. Because of the new tuple we can't mark the page all-frozen. But there's
> also no reason to not use much less aggressive horizon than OldestXmin, namely
> the newer of xmin,xmax of the old frozen tuple?
We don't track the newest frozen xmin right now. Doing so wouldn't be
free (i.e. more comparisons which may matter in a query without much
other overhead). And we can't get rid of any of the other cutoffs we
track. We'd still need to maintain the visibility_cutoff_xid and
latest_removed_xid. Which also means it doesn't simplify the code.
The only purpose it would serve is to make the snapshot conflict
horizon more accurate/more aggressive when we freeze tuples, which
would lead to canceling less queries than master -- which is outside
the purview of this patch. There's also a set of complications around
maintaining this number accurately mentioned by Peter in [1].
I've added a new patch in the series 0001 that expands the comment
about this in heap_page_will_freeze() and describes that we are using
a coarse cutoff because we don't track anything else.
But I don't think changing this behavior is a blocker for this feature.
> I also don't understand what the "false conflicts" thing is referencing.
Yea, neither do I. I ported that comment over (it's from before I
started modifying this code) and have never really understood what it
meant -- wouldn't we have more false conflicts if we use a newer
cutoff? (because OldestXmin will be newer than visibility_cutoff_xid).
If the page isn't all-visible, we won't maintain
visibility_cutoff_xid. But if we did actually track the newest live
tuple xmin on the page, that could very well be newer than OldestXmin
and thus using OldestXmin would cancel less and avoid more false
conflicts. But that feels like a _very_ big stretch (in a hypothetical
world that doesn't exist). Anyway, I've deleted the comment (in 0001)
since it clearly is not adding value.
> > + TransactionId conflict_xid;
> > +
> > + /*
> > + * We can omit the snapshot conflict horizon if we are not pruning or
> > + * freezing any tuples and are setting an already all-visible page
> > + * all-frozen in the VM. In this case, all of the tuples on the page must
> > + * already be visible to all MVCC snapshots on the standby.
> > + */
>
> The last sentence here is a bit confusing, because they don't just need to
> already be visible to everyone, they already need to be frozen. Right?
Right. Well that also means that all tuples are visible to all MVCC
snapshots on the standby too -- I picked that language up from
somewhere else in the code and thought it sounded good. But I've
edited it to say frozen, which is more accurate.
> > + if (!do_prune &&
> > + !do_freeze &&
> > + do_set_vm &&
>
> I'm confused by the do_set_vm check here. Doesn't it mean that we will *not*
> return InvalidTransactionId if !prstate->attempt_update_vm? I don't undestand
> why that would make sense.
>
> I guess we'll compute a bogus cutoff in that cse, but never use it, since
> we'll also not emit WAL? Or maybe we'll just unnecessarily go through the
> code below, because the code turns out to do ok regardles? It's confusing
> either way.
Ah no, that was a relic from when I didn't clear the new vmbits in the
event that they were the same as old_vmbits. Now that I do that, I
don't need to qualify it with do_set_vm anymore. I've fixed that.
> > + (old_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0 &&
> > + (new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
>
> I wonder if some of this code would end up cleaner if we tracked the bits we
> intend to add, rather than the target set of bits is.
I played around with this idea. Unfortunately, this specific check
wouldn't be much simpler, as I'd have to check that _only_ the
all-frozen bit is set. And I think that it is probably more clear with
explicit old and new vmbits. Those make it clear that we are setting a
formerly all-visible page all-frozen.
In some of the other places where I use old/new vmbits, I tried only
using the delta/bits that need to be newly set. But, for example, in
visibilitymap_set(), I need both the all-visible and all-frozen bits
to be passed (even if all-visible is already set), because it asserts
that if all-frozen is passed both all-visible and all-frozen are
passed. If I only keep track of the new bits that need to be set, I
would require other logic before visibilitymap_set(). And it makes
sense that visibilitymap_set() requires both because you don't ever
want people setting just all-frozen -- and you don't want the API to
make it easy to do that.
> > + /*
> > + * If we are updating the VM, the conflict horizon is almost always the
> > + * visibility cutoff XID.
> > + *
> > + * Separately, if we are freezing any tuples, as an optimization, we can
> > + * use the visibility_cutoff_xid as the conflict horizon if the page will
> > + * be all-frozen.
>
> What does "as an optimization" mean here?
What I meant was that visibility_cutoff_xid is going to be older than
OldestXmin for an all-frozen page, so it will lead to canceling fewer
queries. So, using it is kind of an "optimization". But I re-read that
comment and it was way too confusing. I actually cut that whole
paragraph because it should be discussed in heap_page_will_freeze()
where we are actually handling it.
> Note that the code actually uses visibility_cutoff_xid even if the page is
> just marked all-visible, but not all-frozen (due to the do_set_vm check being
> earlier)
And that is correct. But the comment (which is now removed) was
misleading, you are right.
> > This is true even if there are LP_DEAD line pointers
> > + * because we ignored those when maintaining the visibility_cutoff_xid.
>
> I must just be missing something because I can't follow this at all. I guess
> it could be correct because we later then add in knowledge of removed xids in
> via the TransactionIdFollows check below? But if that's it, this is extremely
> confusingly worded.
Yes, we add in latest_xid_removed which is what makes it correct. But
I agree that the wording was terrible. I've cut those details from
get_conflict_xid() and kept them where they are relevant in
heap_page_will_freeze().
- Melanie
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-01-28 23:19:32 | Re: [PATCH] llvmjit: always add the simplifycfg pass |
| Previous Message | Chao Li | 2026-01-28 23:13:44 | Re: Correction to comment of brin_range_deserialize |