From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Subject: | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
Date: | 2025-09-05 22:20:21 |
Message-ID: | CAAKRu_Yz9x0sejBa5ov_LJ5sMOSKM3AeKOFUg+fQpNqyMmxwRA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks for the review!
On Tue, Sep 2, 2025 at 7:54 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay
I didn't push it yet because I did a new version that actually
eliminates the asserts in heap_multi_insert() before calling
visibilitymap_set() -- since they are redundant with checks inside
visibilitymap_set(). 0001 of attached v9 is what I plan to push,
barring any objections.
> > From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Wed, 27 Aug 2025 10:07:29 -0400
> > Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set
I pushed this.
> From an abstraction POV I don't love that heapam now is responsible for
> acquiring and releasing the lock. But that ship already kind of has sailed, as
> heapam.c is already responsible for releasing the vm buffer etc...
>
> I've wondered about splitting the responsibilities up into multiple
> visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
> lock and set the LSN. But it's probably not worth it.
Yea, I explored heap wrappers coupling heap operations related to
setting the VM along with the VM updates [1], but the results weren't
appealing. Setting the heap LSN and marking the heap buffer dirty and
such happens in a different place in different callers because it is
happening as part of the operations that actually end up rendering the
page all-visible.
And a VM-only helper would literally just acquire and release the lock
and set the LSN on the vm page -- which I don't think is worth it.
> > + /*
> > + * Now read and update the VM block. Even if we skipped updating the heap
> > + * page due to the file being dropped or truncated later in recovery, it's
> > + * still safe to update the visibility map. Any WAL record that clears
> > + * the visibility map bit does so before checking the page LSN, so any
> > + * bits that need to be cleared will still be cleared.
> > + *
> > + * It is only okay to set the VM bits without holding the heap page lock
> > + * because we can expect no other writers of this page.
> > + */
> > + if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > + XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > + &vmbuffer) == BLK_NEEDS_REDO)
> > + {
> > + Relation reln = CreateFakeRelcacheEntry(rlocator);
> > +
> > + Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> > + visibilitymap_set_vmbyte(reln, blkno,
> > + vmbuffer,
> > + VISIBILITYMAP_ALL_VISIBLE |
> > + VISIBILITYMAP_ALL_FROZEN);
> > +
> > + /*
> > + * It is not possible that the VM was already set for this heap page,
> > + * so the vmbuffer must have been modified and marked dirty.
> > + */
> > + Assert(BufferIsDirty(vmbuffer));
>
> How about making visibilitymap_set_vmbyte() return whether it needed to do
> something? This seems somewhat indirect...
It does return the state of the previous bits. But, I am specifically
asserting that the buffer is dirty because I am about to set the page
LSN. So I don't just care that changes were made, I care that we
remembered to mark the buffer dirty.
> I think it might be good to encapsulate this code into a helper in
> visibilitymap.c, there will be more callers in the subsequent patches.
By the end of the set, the different callers have different
expectations (some don't expect the buffer to have been dirtied
necessarily) and where they do the various related operations is
spread out depending on the caller. I just couldn't come up with a
helper solution I liked.
That being said, I definitely don't think it's needed for this patch
(logging setting the VM in xl_heap_multi_insert()).
> > +uint8
> > +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> > + Buffer vmBuf, uint8 flags)
>
> Why is it named vmbyte? This actually just sets the two bits corresponding to
> the buffer, not the entire byte. So it seems somewhat misleading to reference
> byte.
Renamed it to visibilitymap_set_vmbits.
> > Instead of emitting a separate xl_heap_visible record for each page that
> > is rendered all-visible by vacuum's third phase, include the updates to
> > the VM in the already emitted xl_heap_prune record.
>
> Reading through the change I didn't particularly like that there's another
> optional field in xl_heap_prune, as it seemed liked something that should be
> encoded in flags. Of course there aren't enough flag bits available. But
> that made me look at the rest of the record: Uh, what do we use the reason
> field for? As far as I can tell f83d709760d8 added it without introducing any
> users? It doesn't even seem to be set.
yikes, you are right about the "reason" member. Attached 0002 removes
it, and I'll go ahead and fix it in the back branches too. I can't
fathom how that slipped through the cracks. We do pass the PruneReason
for setting the rmgr info about what type of record it is (i.e. if it
is one emitted by vacuum phase I, phase III, or on-access pruning).
But we don't need or use a separate member.. I went back and tried to
figure out what the rationale was, but I couldn't find anything.
As for the VM flags being an optional unaligned member -- in v9, I've
expanded the flags member to a uint16 to make room for the extra
flags. Seems we've been surviving with using up 2 bytes this long.
> > @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> > (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
> >
> > /*
> > - * We are about to remove and/or freeze tuples. In Hot Standby mode,
> > - * ensure that there are no queries running for which the removed tuples
> > - * are still visible or which still consider the frozen xids as running.
> > - * The conflict horizon XID comes after xl_heap_prune.
> > + * After xl_heap_prune is the optional snapshot conflict horizon.
> > + *
> > + * In Hot Standby mode, we must ensure that there are no running queries
> > + * which would conflict with the changes in this record. If pruning, that
> > + * means we cannot remove tuples still visible to transactions on the
> > + * standby. If freezing, that means we cannot freeze tuples with xids that
> > + * are still considered running on the standby. And for setting the VM, we
> > + * cannot do so if the page isn't all-visible to all transactions on the
> > + * standby.
> > */
>
> I'm a bit confused by this new comment - it sounds like we're deciding whether
> to remove tuple versions, but that decision has long been made, no?
Well, the comment is a revision of a comment that was already there on
essentially why replaying this record could cause recovery conflicts.
It mentioned pruning and freezing, so I expanded it to mention setting
the VM. Taking into account your confusion, I tried rewording it in
attached v9.
> > + if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> > + vacrel->cutoffs.OldestXmin,
> > + deadoffsets, num_offsets,
> > + &all_frozen, &visibility_cutoff_xid,
> > + &vacrel->offnum))
>
> I am rather confused - we never can set all-visible if there are any LP_DEAD
> items left. If the idea is that we are removing the LP_DEAD items in
> lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
> removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
> vacuum visited the page and collected dead items?
>
> Ugh, I see - it works because we pass in the set of dead items. I think that
> makes the name *really* misleading, it's not except LP_DEAD, it's except the
> offsets passed in, no?
>
> But then you actually check that the set of dead items didn't change - what
> guarantees that?
So, I pass in the deadoffsets we got from the TIDStore. If the only
dead items on the page are exactly those dead items, then the page
will be all-visible as soon as we set those LP_UNUSED -- which we do
unconditionally. And we have the lock on the page, so no one can
on-access prune and make new dead items while we are in
lazy_vacuum_heap_page().
Given your confusion, I've refactored this and used a different
approach -- I explicitly check the passed-in deadoffsets array when I
encounter a dead item and see if it is there. That should hopefully
make it more clear.
> I didn't look at the later patches, except that I did notice this:
<--snip-->
> Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
> have done so, as you noticed in one of the early on patches?
Fixed. Thanks!
- Melanie
From | Date | Subject | |
---|---|---|---|
Next Message | Melanie Plageman | 2025-09-05 22:27:05 | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
Previous Message | Masahiko Sawada | 2025-09-05 22:15:42 | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |