From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Subject: | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
Date: | 2025-09-18 00:10:07 |
Message-ID: | CAAKRu_YOJ3VTKo4Z9vB2hGeTnwVWsL39gXH09vyBUQ7bGtDnKA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Sep 10, 2025 at 4:01 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> > puts an end to the horror.
>
> I suggest not renumbering all of the existing flags and just adding
> these new ones at the end. Less code churn and more likely to break in
> an obvious way if you mix up the two sets of flags.
Makes sense. In my attached v14, I have not renumbered them.
> More on 0002:
After an off-list discussion we had about how to make the patches in
the set progressively improve the code instead of just mechanically
refactoring it, I have made some big changes in the intermediate
patches in the set.
Before actually including the VM changes in the vacuum/prune WAL
records, I first include setting PD_ALL_VISIBLE with the other changes
to the heap page so that we can remove the heap page from the VM
setting WAL chain. This happens to fix the bug we discussed where if
you set an all-visible page all-frozen and checksums/wal_log_hints are
enabled, you may end up setting an LSN on a page that was not marked
dirty.
0001 is RFC but waiting on one other reviewer
0002 - 0007 is a bit of cleanup I had later in the patch set but moved
up because I think it made the intermediate patches better
0008 - 0012 removes the heap page from the XLOG_HEAP2_VISIBLE WAL
chain (it makes all callers of visibilitymap_set() set PD_ALL_VISIBLE
in the same WAL record as changes to the heap page)
0013 - 0018 finish the job eliminating XLOG_HEAP2_VISIBLE and set VM
bits in the same WAL record as the heap changes
0019 - 0024 set the VM on-access
> /*
> + * Note that the heap relation may have been dropped or truncated, leading
> + * us to skip updating the heap block due to the LSN interlock. However,
> + * even in that case, it's still safe to update the visibility map. Any
> + * WAL record that clears the visibility map bit does so before checking
> + * the page LSN, so any bits that need to be cleared will still be
> + * cleared.
> + *
> + * Note that the lock on the heap page was dropped above. In normal
> + * operation this would never be safe because a concurrent query could
> + * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> + * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> + * the VM is set.
> + *
> + * In recovery, we expect no other writers, so writing to the VM page
> + * without holding a lock on the heap page is considered safe enough. It
> + * is done this way when replaying xl_heap_visible records (see
> */
>
> How many copies of this comment do you plan to end up with?
By the end, one for copy freeze replay and one for prune/freeze/vacuum
replay. I felt two wasn't too bad and was easier than meta-explaining
what the other comment was explaining.
> > > 0004. It is not clear to me why you need to get
> > > log_heap_prune_and_freeze to do the work here. Why can't
> > > log_newpage_buffer get the job done already?
> >
> > Well, I need something to emit the changes to the VM. I'm eliminating
> > all users of xl_heap_visible. Empty pages are the ones that benefit
> > the least from switching from xl_heap_visible -> xl_heap_prune. But,
> > if I don't transition them, we have to maintain all the
> > xl_heap_visible code (including visibilitymap_set() in its long form).
> >
> > As for log_newpage_buffer(), I could keep it if you think it is too
> > confusing to change log_heap_prune_and_freeze()'s API (by passing
> > force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> > there and then call log_heap_prune_and_freeze().
> >
> > I just thought it seemed simple to avoid emitting the new page record
> > and the VM update record, so why not -- but I don't have strong
> > feelings.
>
> Yeah, I'm not sure what the right thing to do here is. I think I was
> again experiencing brain fade by forgetting that there is a heap page
> and a VM page and, of course, log_heap_newpage() probably isn't going
> to touch the latter. So that makes sense. On the other hand, we could
> only have one type of WAL record for every single operation in the
> system if we gave it enough flags, and force_heap_fpi seems
> suspiciously like a flag that turns this into a whole different kind
> of WAL record.
I've kept log_heap_newpage() and used log_heap_prune_and_freeze() for
setting PD_ALL_VISIBLE and the VM.
> > > 0005. It looks a little curious that you delete the
> > > identify-corruption logic from the end of the if-nest and add it to
> > > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > > corruption is a rare case.
> >
> > On master, the two corruption cases are sandwiched between the normal
> > VM set cases. And I actually think doing it this way is brittle. If
> > you put the cases which set the VM first, you have to have completely
> > bulletproof the if statements guarding them to foreclose any possible
> > corruption case from entering because otherwise you will overwrite the
> > corruption you then try to detect.
>
> Hmm. In the current code, we first test (!all_visible_according_to_vm
> && presult.all_visible), then (all_visible_according_to_vm &&
> !PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
> blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
> PageIsAllVisible(page)). The first and second can never coexist,
> because they require opposite values of all_visible_according_to_vm.
> The second and third cannot coexist because they require opposite
> values of PageIsAllVisible(page). It is not entirely obvious that the
> first and third tests couldn't both pass, but you'd have to have
> presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
> to see how heap_page_prune_and_freeze() could ever allow that.
> Consider:
>
> if (prstate.all_visible && prstate.lpdead_items == 0)
> {
> presult->all_visible = prstate.all_visible;
> presult->all_frozen = prstate.all_frozen;
> }
> else
> {
> presult->all_visible = false;
> presult->all_frozen = false;
> }
> ...
> presult->lpdead_items = prstate.lpdead_items;
>
> So I don't really think I'm persuaded that the current way is brittle.
I meant brittle because it has to be so carefully coded for it to work
out this way. If you ever wanted to change or enhance it, it's quite
hard to know how to make sure all of them are entirely mutually
exclusive.
> But that having been said, I agree with you that the order of the
> checks is kind of random, and I don't think it really matters that
> much for performance. What does matter is clarity. I feel like what
> I'd ideally like this logic to do is say: do we want the VM bit for
> the page to be set to all-frozen, just all-visible, or neither? Then
> push the VM bit to the correct state, dragging the page-level bit
> along behind. And the current logic sort of does that. It's roughly:
>
> 1. Should we go from not-all-visible to either all-visible or
> all-frozen? If yes, do so.
> 2. Should we go from either all-visible or all-frozen to
> not-all-visible? If yes, do so.
> 3. Should we go from either all-visible or all-frozen to
> not-all-visible for a different reason? If yes, do so.
> 4. Should we go from all-visible to all-frozen? If yes, do so.
I don't necessarily agree that fixing corruption and setting the VM
should be together -- they feel like separate things to me. But, I
don't feel strongly enough about it to push it.
> But what's weird is that all the tests are written differently, and we
> have two different reasons for going to not-all-visible, namely
> PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
> one test for each of the other state-transitions, because the
> decision-making for those cases is fully completed at an earlier
> stage. I would kind of like to see this expressed in a way that first
> decides which state transition to make (forward-to-all-frozen,
> forward-to-all-visible, backward-to-all-visible,
> backward-to-not-all-visible, nothing) and then does the corresponding
> work. What you're doing instead is splitting half of those functions
> off into a helper function while keeping the other half where they are
> without cleaning up any of the logic. Now, maybe that's OK: I'm far
> from having grokked the whole patch set. But it is not any more clear
> than what we have now, IMHO, and perhaps even a bit less so.
In terms of my patch set, I do have to change something about this
mixture of fixing corruption and setting the VM because I need to set
the VM bits in the same critical section as making the other changes
to the heap page (pruning, etc) and include the VM set changes in the
same WAL record (note that clearing the VM to fix corruption is not
WAL-logged).
What I've gone with is determining what to set the VM bits to and then
fixing the corruption at the same time. Then, later, when making the
changes to the heap page, I actually set the VM. This is kind of the
opposite of what you suggested above -- determining what to set the
bits to altogether -- corruption and non-corruption cases together. I
don't think we can do that though, because fixing the corruption is
non WAL-logged changes to the page and VM and setting the VM bits is a
WAL-logged change. And, you can't clear bits with visibilitymap_set()
(there's an assertion about that). So you have to call different
functions (not to mention emit distinct error messages). I don't know
that I've come up with the ideal solution, though.
- Melanie
From | Date | Subject | |
---|---|---|---|
Next Message | Chao Li | 2025-09-18 00:11:08 | Re: Incorrect logic in XLogNeedsFlush() |
Previous Message | Sami Imseih | 2025-09-18 00:04:36 | Re: PgStat_HashKey padding issue when passed by reference |