From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Subject: | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
Date: | 2025-09-24 17:07:46 |
Message-ID: | CAAKRu_ZiuR+YcUc7=TrgANbRakpzCu8X9zqR=Tf0fE6uDbfP1g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Sep 18, 2025 at 12:48 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
>
> > + /*
> > + * Now read and update the VM block.
> > + *
> > + * Note that the heap relation may have been dropped or truncated, leading
> > + * us to skip updating the heap block due to the LSN interlock.
>
> I don't fully understand this - how does dropping/truncating the relation lead
> to skipping due to the LSN interlock?
Yes, this wasn't right. I misunderstood.
What I think it should say is that if the heap update was skipped due
to LSN interlock we still have to replay the updates to the VM because
each vm page contains bits for multiple heap blocks and if the record
included a vm page FPI, subsequent updates to the VM may rely on this
FPI to avoid torn pages. We don't condition it on the heap redo having
been an FPI, probably because it is not worth it -- but I wonder if
that is worth calling out in the comment?
Do we also need to replay it when the heap redo returns BLK_NOTFOUND?
I assume this can happen in the case of relation dropped or truncated
-- but in this case there wouldn't be subsequent records updating the
VM for other heap blocks that we need to replay because the other heap
blocks won't be found either, right?
> > + if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > + XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > + &vmbuffer) == BLK_NEEDS_REDO)
> > + {
>
> Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
> heap_xlog_visible(), but I don't immediately understand (or remember) why we
> do so there either?
It has been RBM_ZERO_ON_ERROR since XLogReadBufferForRedoExtended()
was introduced here in 2c03216d8311.
I think we probably do this because vm_readbuf() passes ReadBuffer()
RBM_ZERO_ON_ERROR and has this comment
* For reading we use ZERO_ON_ERROR mode, and initialize the page if
* necessary. It's always safe to clear bits, so it's better to clear
* corrupt pages than error out.
Do you think I also should have a comment in heap_xlog_multi_insert()?
> > + Page vmpage = BufferGetPage(vmbuffer);
> > + Relation reln = CreateFakeRelcacheEntry(rlocator);
>
> Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
> really like to eventually get rid of that and given that the new "code shape"
> delegates a lot more responsibility to the redo routines, they should have a
> fairly easy time not needing a fake relcache? Afaict the relation already is
> not used outside of debugging paths?
Yes, interestingly we don't have the relname in recovery anyway, so it
does all this fake relcache stuff only to convert the relfilenode to a
string and uses that.
The fake relcache stuff will still be used by visibilitymap_pin()
which callers like heap_xlog_delete() use to get the VM page. And I
don't think it is worth coming up with a version of that that doesn't
use the relcache. But you're right that the Relation is not needed for
visibilitymap_set_vmbits(). I've changed it to just take the relation
name as a string.
> > + /* initialize the page if it was read as zeros */
> > + if (PageIsNew(vmpage))
> > + PageInit(vmpage, BLCKSZ, 0);
> > +
> > + visibilitymap_set_vmbits(reln, blkno,
> > + vmbuffer,
> > + VISIBILITYMAP_ALL_VISIBLE |
> > + VISIBILITYMAP_ALL_FROZEN);
> > +
> > + /*
> > + * It is not possible that the VM was already set for this heap page,
> > + * so the vmbuffer must have been modified and marked dirty.
> > + */
>
> I assume that's because we a) checked the LSN interlock b) are replaying
> something that needed to newly set the bit?
Yes, perhaps it is not worth having the assert since it attracts extra
attention to an invariant that is unlikely to be in danger of
regression.
> Seems 0002 should just be applied...
Done
> Re 0003: I wonder if it's getting to the point that a struct should be used as
> the argument.
I have been thinking about this. I have yet to come up with a good
idea for a struct name or multiple struct names that seem to fit here.
I could move the other output parameters into the PruneFreezeResult
and then maybe make some kind of PruneFreezeParameters struct or
something?
- Melanie
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2025-09-24 17:11:20 | Re: Add memory_limit_hits to pg_stat_replication_slots |
Previous Message | Nathan Bossart | 2025-09-24 16:52:09 | Re: Clarification on Role Access Rights to Table Indexes |