Re: Show various offset arrays for heap WAL records

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Show various offset arrays for heap WAL records
Date: 2023-01-31 21:52:14
Message-ID: CAH2-WznyCmZ8k4=2jJ3B7cGkjn1pDgJCv4T6qXrFh+kCO8L+iw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> I have taken a stab at doing some of the tasks listed in this email.

Cool.

> I have made the new files rmgr_utils.c/h.
>
> I have come up with a standard format that I like for the output and
> used it in all the heap record types.
>
> Examples below:

That seems like a reasonable approach.

> I started documenting it in the rmgr_utils.h header file in a comment,
> however it may be worth a README?
>
> I haven't polished this description of the format (or added examples,
> etc) or used it in the btree-related functions because I assume the
> format and helper function API will need more discussion.

I think that standardization is good, but ISTM that we need clarity on
what the scope is -- what is *not* being standardized? It may or may
not be useful to call the end result an API. Or it may not make sense
to do so in the first committed version, even though we may ultimately
end up as something that deserves to be called an API. The obligation
to not break tools that are scraping the output in whatever way seems
kind of onerous right now -- just not having any gratuitous
inconsistencies (e.g., fixing totally inconsistent punctuation, making
the names for fields across WAL records consistent when they serve
exactly the same purpose) would be a big improvement.

As I mentioned in passing already, I actually don't think that the
B-Tree WAL records are all that special, as far as this stuff goes.
For example, the DELETE Btree record type is very similar to heapam's
PRUNE record type, and practically identical to Btree's VACUUM record
type. All of these record types use the same basic conventions, like a
snapshotConflictHorizon field for recovery conflicts (which is
generated in a very similar way during original execution, and
processed in precisely the same way during REDO), and arrays of page
offset numbers sorted in ascending order.

There are some remaining details where things from an index AM WAL
record aren't directly analogous (or pretty much identical) to some
other heapam WAL records, such as the way that the DELETE Btree record
type deals with deleting a subset of TIDs from a posting list index
tuple (generated by B-Tree deduplication). But even these exceptions
don't require all that much discussion. You could either choose to
only display the array of deleted index tuple page offset numbers, as
well as the similar array of "updated" index tuple page offset numbers
from xl_btree_delete, in which case you just display two arrays of
page offset numbers, in the same standard way. You may or may not want
to also show each individual xl_btree_update entry -- doing so would
be kinda like showing the details of individual freeze plans, except
that you'd probably display something very similar to the page offset
number display here too (even though these aren't page offset numbers,
they're 0-based offsets into the posting list's item pointer data
array).

BTW, there is also a tendency for non-btree index AM WAL records to be
fairly similar or even near-identical to the B-Tree WAL records. While
Hash indexes are very different to B-Tree indexes at a high level, it
is nevertheless the case that xl_hash_vacuum_one_page is directly
based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
directly based on xl_btree_insert. There are some other WAL record
types that are completely different across hash and B-Tree, which is a
reflection of the fact that the index grows using a totally different
approach in each AM -- but that doesn't seem like something that
throws up any roadblocks for you (these can all be displayed as simple
structs anyway).

Speaking with my B-Tree hat on, I'd just be happy to be able to see
both of the page offset number arrays (the deleted + updated offset
number arrays from xl_btree_delete/xl_btree_vacuum), without also
being able to\ see output for each individual xl_btree_update
item-pointer-array-offset arrays -- just seeing that much is already a
huge improvement. That's why I'm a bit hesitant to use the term API
just yet, because an obligation to be consistent in whatever way seems
like it might block incremental progress.

> Perhaps there should also be example output of the offset arrays in
> pgwalinspect docs?

That would definitely make sense.

> I've changed the array format helper functions that Andres added to be a
> single function with an additional layer of indirection so that any
> record with an array can use it regardless of type and format of the
> individual elements. The signature is based somewhat off of qsort_r()
> and allows the user to pass a function with the the desired format of
> the elements.

That's handy.

> Personally, I like having the infomasks for the freeze plans. If we
> someday have a more structured input to rmgr_desc, we could then easily
> have them in their own column and use functions like
> heap_tuple_infomask_flags() on them.

I agree, in general, though long term the best approach is one that
has a configurable level of verbosity, with some kind of roughly
uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
probably with only 2 or 3 distinct levels).

Obviously what you're doing here will lead to a significant increase
in the verbosity of the output for affected WAL records. I don't feel
too bad about that, though. It's really an existing problem, and one
that should be fixed either way. You kind of have to deal with this
already, by having a good psql pager, since record types such as
COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very
verbose in roughly the same way. You only need to have one of these
record types output by a function like pg_get_wal_records_info() to
get absurdly wide output -- it hardly matters that most individual WAL
record types have terse output at that point.

> > > How hard would it be to invent a general mechanism to control the verbosity
> > > of what we'll show for each WAL record?
> >
> > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> > void (*rm_desc) (StringInfo buf, XLogReaderState *record);
> >
> > so we'd need to patch all of them. That might be worth doing at some point,
> > but I don't want to tackle it right now.
>
> In terms of a more structured format, it seems like it would make the
> most sense to pass a JSON or composite datatype structure to rm_desc
> instead of that StringInfo.
>
> I would also like to see functions like XLogRecGetBlockRefInfo() pass
> something more useful than a stringinfo buffer so that we could easily
> extract out the relfilenode in pgwalinspect.

That does seem particularly important. It's a pain to do this from
SQL. In general I'm okay with focussing on pg_walinspect over
pg_waldump, since it'll become more important over time. Obviously
pg_waldump needs to still work, but I think it's okay to care less
about pg_waldump usability.

> > BTW, while playing around with this patch today, I noticed that it
> > won't display the number of elements in each offset array directly.
> > Perhaps it's worth including that, too?
>
> I believe I have addressed this in the attached patch.

Thanks for taking care of that.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-01-31 21:56:31 Re: heapgettup() with NoMovementScanDirection unused in core?
Previous Message Thomas Munro 2023-01-31 21:20:23 Re: pg_upgrade test failure