Re: Show various offset arrays for heap WAL records

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: Show various offset arrays for heap WAL records
Date: 2023-04-07 20:33:08
Message-ID: CAAKRu_b2oE4GL=q4g9mcByS9yT7wTQvEH9OLpabj28e+WKFi2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached v3 is cleaned up and includes a pg_walinspect docs update as
well as some edited comments in rmgr_utils.c

On Mon, Mar 27, 2023 at 6:27 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > I went to add dedup records and noticed that since the actual
> > BTDedupInterval struct is what is put in the xlog, I would need access
> > to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
> > work because it includes files that cannot be included in frontend code.
>
> I suppose that the BTDedupInterval struct could have just as easily
> gone in nbtxlog.h, next to xl_btree_dedup. There might have been a
> moment where I thought about doing it that way, but I guess I found it
> slightly preferable to use that symbol name (BTDedupInterval) rather
> than (say) xl_btree_dedup_interval in places like the nearby
> BTDedupStateData struct.
>
> Actually, I suppose that it's hard to make that alternative work, at
> least without
> including nbtxlog.h in nbtree.h. Which sounds wrong.
>
> > I, of course, could make some local struct in nbtdesc.c which has an
> > OffsetNumber and a uint16, since the BTDedupInterval is pretty
> > straightforward, but that seems a bit annoying.
> > I'm probably missing something obvious, but is there a better way to do
> > this?
>
> It was probably just one of those cases where I settled on the
> arrangement that looked least odd overall. Not a particularly
> principled approach. But the approach that I'm going to take once more
> here. ;-)
>
> All of the available alternatives are annoying in roughly the same
> way, though perhaps to varying degrees. All except one: I'm okay with
> just not adding coverage for deduplication records, for the time being
> -- just seeing the number of intervals alone is relatively informative
> with deduplication records, unlike (say) nbtree delete records. I'm
> also okay with having coverage for dedup records if you feel it's
> worth having. Your call.
>
> If we're going to have coverage for deduplication records then it
> seems to me that we have to have a struct in nbtxlog.h for your code
> to work off of. It also seems likely that we'll want to use that same
> struct within nbtxlog.c. What's less clear is what that means for the
> BTDedupInterval struct. I don't think that we should include nbtxlog.h
> in nbtree.h, nor should we do the converse.
>
> I guess maybe two identical structs would be okay. BTDedupInterval,
> and xl_btree_dedup_interval, with the former still used in nbtdedup.c,
> and the latter used through a pointer at the point that nbtxlog.c
> reads a dedup record. Then maybe at a sizeof() static assert beside
> the existing btree_xlog_dedup() assertions that check that the dedup
> state interval array matches the array taken from the WAL record.
> That's still a bit weird, but I find it preferable to any alternative
> that I can think of.

I've omitted enhancements for the dedup record type for now.

> > On another note, I've thought about how to include some example output
> > in docs, and, for example we could modify the example output in the
> > pgwalinspect docs which includes a PRUNE record already for
> > pg_get_wal_record_info() docs. We'd probably just want to keep it short.
>
> Yeah. Perhaps a PRUNE record for one of the system catalogs whose
> relfilenode is relatively recognizable. Say pg_class. It probably
> doesn't matter that much, but there is perhaps some small value in
> picking an example that is relatively easy to recreate later on (or to
> approximately recreate). I'm certainly not insisting on that, though.

I've added such an example to pg_walinspect docs.

On Tue, Mar 21, 2023 at 6:37 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > There are several different things that seem important to me
> > personally. These are in tension with each other, to a degree. These
> > are:
> >
> > 1. Like Andres, I'd really like to have some way of inspecting things
> > like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
> > detail. These record types happen to be very important in general, and
> > the ability to see detailed information about the WAL record would
> > definitely help with some debugging scenarios. I've really missed
> > stuff like this while debugging serious issues under time pressure.
>
> One problem that I often run into when performing analysis of VACUUM
> using pg_walinspect is the issue of *who* pruned which heap page, for
> any given PRUNE record. Was it VACUUM/autovacuum, or was it
> opportunistic pruning? There is no way of knowing for sure right now.
> You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
> record coming from VACUUM; it could just have been an opportunistic
> prune operation that happened to take place when a SELECT query ran,
> before any XID was ever allocated.
>
> I think that we should do something like the attached, to completely
> avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
> similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
> XLOG_HEAP2 record types to indicate that they took place during
> VACUUM, by XOR'ing the flag with the record type/info when
> XLogInsert() is called. For now this is only used by PRUNE records.
> Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
> record_type, as well as the unadorned Heap2/PRUNE record_type, which
> we'll now know must have been opportunistic pruning.
>
> The approach of using a bit in the style of the heapam init bit makes
> sense to me, because the bit is available, and works in a way that is
> minimally invasive. Also, one can imagine needing to resolve a similar
> ambiguity in the future, when (say) opportunistic freezing is added.
>
> I think that it makes sense to treat this within the scope of
> Melanie's ongoing work to improve the instrumentation of these records
> -- meaning that it's in scope for Postgres 16. Admittedly this is a
> slightly creative interpretation, so if others disagree then I won't
> argue. This is quite a small patch, though, which makes debugging
> significantly easier. I think that there could be a great deal of
> utility in being able to easily "pair up" corresponding
> "Heap2/PRUNE+BYVACUUM" and "Heap2/VACUUM" records in debugging
> scenarios. I can imagine linking these to "Heap2/FREEZE_PAGE" and
> "Heap2/VISIBLE" records, too, since they're all closely related record
> types.

I really like this idea and would find it useful. I reviewed the patch
and tried it out and it worked for me and code looked fine as well.

I didn't include it in the attached patchset because I don't feel
confident enough in my own understanding of any potential implications
of splitting up these record types to definitively endorse it. But, if
someone else felt comfortable with it, I would like to see it in the
tree.

- Melanie

Attachment Content-Type Size
v1-0001-Record-which-PRUNE-records-are-from-VACUUM.patch text/x-patch 5.2 KB
v3-0001-Add-rmgr_desc-utilities.patch text/x-patch 17.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-04-07 20:38:43 Re: cataloguing NOT NULL constraints
Previous Message Daniel Gustafsson 2023-04-07 20:24:04 Re: Making background psql nicer to use in tap tests