Re: pg_waldump: support decoding of WAL inside tarfile

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_waldump: support decoding of WAL inside tarfile
Date: 2026-01-19 11:13:30
Message-ID: CAAJ_b97VUiP-DbLNe-ddq64J_RiB4ZcPgAjHkJH-0dbzgjR++A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 17, 2026 at 2:08 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Jan 2, 2026 at 7:30 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
> > Attached is the rebased version, includes some code comment improvements.
>

Thanks for the feedback. I am replying to a few comments inline below
regarding elements that are unlikely to change -- explaining the
reasoning why I believe they are correct -- just to ensure we are
aligned. For the remaining points, I generally agree and will either
implement the improvements or provide my reasoning if I don't.

>
> + if (len > 0 && recptr > startPtr)
> + {
> + int skipBytes = 0;
> +
> + /*
> + * The required offset is not at the start of
> the buffer, so skip
> + * bytes until reaching the desired offset of
> the target page.
> + */
> + skipBytes = recptr - startPtr;
> +
> + buf += skipBytes;
> + len -= skipBytes;
> + }
>
> This logic seems pretty confusing. It looks as though len can go
> negative, because what if skipBytes > len? I'm not sure that's really
> possible here, but there's no comments explaining why it can't and no
> assertion verifying that it doesn't. I think you should probably try
> to unify this with the following if statement that is gated by (len >
> 0).

The preceding check if (recptr >= endPtr) ensures that within this
block, recptr is always less than endPtr. Since startPtr is derived as
endPtr - len, recptr naturally falls within the expected range (i.e.,
recptr - startPtr <= len). However, I agree that we can simplify this
by combining it with the subsequent if statement and/or by adding an
assert there.

>
> + entry = privateInfo->cur_wal;
> +
> + /* Fetch more data */
> + if (read_archive_file(privateInfo, READ_CHUNK_SIZE) == 0)
> + break; /* archive file ended */
>
> Why do we set entry to privateInfo->cur_wal and then call
> read_archive_file() afterwards? Doesn't read_archive_file() have the
> potential to change cur_wal? If so, don't we want to look at the entry
> that is current after we've called read_archive_file(), rather than
> before? Also, to harp on the naming a bit more, stuff like entry =
> privateInfo->cur_wal kind of shows that you haven't made the naming
> real consistent. This could be more clear if it said something like
> wal_file = privateInfo->cur_wal_file.

I see your point, and you’re right that assigning the entry to
privateInfo->cur_wal here seems useless; however, it becomes necessary
for the logic in patch 0005. I assigned it just before
read_archive_file() because that function can change where the current
WAL ends and a new one starts. I need to track the completed WAL entry
at that moment, as it may contain data that needs to be spilled to a
temporary file. I'll add a comment to clarify this. I also agree with
your feedback on variable naming and will improve that in the next
revision.

>
> + /* Found the required entry */
> + if (entry->segno == segno)
> + return entry;
> +
> + /*
> + * Ignore if the timeline is different or the current
> segment is not
> + * the desired one.
> + */
> + if (privateInfo->timeline != entry->timeline ||
> + privateInfo->startSegNo > entry->segno ||
> + privateInfo->endSegNo < entry->segno)
> + {
> + privateInfo->cur_wal = NULL;
> + continue;
> + }
> +
>
> I don't really understand what's happening here. One thing that's odd
> is that the first of these two if statements considers the timeline
> and the second does not. It seems unlikely that this is correct. I
> suspect that the hash table needs to be keyed by (TLI, segno) rather
> than just by TLI, and that get_archive_wal_entry needs to take both a
> TLI and a segment number as an argument. Otherwise, what could
> possibly prevent the first if statement from returning the correct
> segment from the wrong timeline?

Yeah, I need to swap the positions of these checks.

>
> Another thing that's odd is: why is this function resetting
> privateInfo->cur_wal to NULL? I feel like it should be the job of
> astreamer_waldump_content() to manage the value of
> privateInfo->cur_wal, and this function doesn't seem to have any
> business touching it. Even if it thinks that we want to ignore that
> WAL file for purposes of *this* call to get_archive_wal_entry(),
> surely it has no right to decide that the *next* call to this function
> should ignore that entry as well.
>
> Honestly, I don't really understand why we need to poke into
> privateInfo at all here. I feel like the shape of this loop should be:
> while (1) { if (read_archive_file(...) == 0) break; if (we have the
> correct entry) return entry; } }. It seems to me that it ought to be
> the job of the archive streamer to handle everything else, including
> buffering more data and spilling to files. This should just iterate,
> repeatedly calling read_archive_file(), until we either get to the
> file we want or run out of archive.
>

We previously tried having the astreamer function perform all
necessary checks (e.g., TLI and segment number), but this became
problematic in init_archive_reader when we needed to read a single WAL
page just to determine the WAL segment size without any validation. To
bypass those checks, I initially used a READ_ANY_WAL() macro, but it
felt like a hack.

In the current design, the archive streamer is responsible only for
reading and copying data to the buffer. To optimize this and avoid
unnecessary memcpy and data buffering for WAL files that aren't
needed, we set privateInfo->cur_wal to NULL, signaling the streamer
code to skip those operations.

But, I am thinking that instead of setting privateInfo->cur_wal to
NULL, we could simply discard the buffer data at that place and let
memcpy in astreamer_content copy if it would be that much of an issue.

> + /*
> + * XXX: If the segment being read not the requested
> one, the data must
> + * be buffered, as we currently lack the mechanism to
> write it to a
> + * temporary file. This is a known limitation that
> will be fixed in the
> + * next patch, as the buffer could grow up to the full
> WAL segment
> + * size.
> + */
> + if (segno > entry->segno)
> + continue;
>
> I couldn't figure out what was going on here for a while. Then I
> realized that this makes some sense: if (segno > entry->segno), that
> means that the segment number that we've found in the archive is older
> than the requested segment number. Because we don't back up, that
> means we don't need the data. I think the idea is that we would fall
> through here and buffer the data if this if-test fails, but patch 0005
> actually removes these two lines of code, so I think something here is
> kind of messed up. I think the result will be that with both patches
> applied, we'll buffer data we don't actually need for anything.
>

Yeah, should discard buffered data too.

> Generally, I think that some details here should be revisited:
>
> +typedef struct ArchivedWALEntry
> +{
> + uint32 status; /* hash status */
> + XLogSegNo segno; /* hash key: WAL
> segment number */
> + TimeLineID timeline; /* timeline of this wal file */
> +
> + StringInfoData buf;
> + bool tmpseg_exists; /* spill file exists? */
> +
> + int total_read; /* total read
> of archived WAL segment */
> +} ArchivedWALEntry;
>
> The first few members are great, except that timeline needs to be part
> of the hash key as well, so you probably need to make a two-element
> struct with a segno and a TLI and make that struct be the thing that
> comes right after status. After that, I'm not so sure. Essentially, I
> think what's happening here is that segno*file_size+total_read is the
> end LSN for buf, and the start LSN value is that value minus buf.len.
> And if tmpseg_exists is true, then the buffer might be empty, and you
> can read however much of the segment we found from the temp file. I
> find that confusing. If we're going to do something like this, I feel
> like it would make more sense to store XLogRecPtr starttli instead of
> int total_read, so that we explicitly mention the first LSN stored in
> the buffer, and you infer the ending LSN from the length of the
> buffer.
>

total_read is not only the total of buffered data; it also includes
data that has been spilled to a temporary file. If the buffer is not
empty, the most recently read data from the archive exists there;
otherwise, if a temporary file exists, that data will be found there.
In short, it tracks the total bytes of WAL that have been read from
the archive. I will improve the code comment to reflect this.

> But, stepping back a bit, why do we have a StringInfoData in every
> ArchivedWALEntry? It makes sense to do it that way if we're going to
> buffer all of the WAL data in memory. In that case, you need a
> separate buffer for every ArchivedWALEntry, and so this is logical.
> But if we're going to spill to disk, then there should only ever be
> one buffer in use at any given time, so why have a buffer in every
> ArchivedWALEntry instead of a single buffer inside of XLogDumpPrivate?
> We could choose to treat the hash table as the set of previous
> segments for which we have written data to disk, and the current
> segment and any associated buffer would be tracked inside the
> XLogDumpPrivate or astreamer_waldump, which would mean that only one
> copy of them would exist. I'm tempted to say that the direction I'm
> sketching here is the right way to go, but I'm not 100% certain that's
> true. It could be write to have a buffer per WAL file, as you have
> here, but if so, we should know why we're doing that and how it fits
> into the design, and I'm not currently seeing a real reason for it.
>

In the earlier version, we had only one buffer where we copied
streamed data, but a problem (previously discussed [1]) arose when the
archive streamer reads data from two WAL files -- i.e., one is ending
and the next is starting. We have to keep track of where the previous
file ends and the next one starts, along with the WAL segment
information that each belongs to.

> + if (strstr(member->pathname, "PaxHeaders."))
> + return false;
>
> There is no way that a filename containing the string "PaxHeaders."
> could ever pass the IsXLogFileName test just above. We shouldn't need
> this.
>

This checks the directory name (e.g.,
x_dir/y_dir/PaxHeaders.NNNN/wal_file_name). The name of that metadata
file is exactly the same as the WAL file name, which is why
IsXLogFileName() doesn't help here.

Regards,
Amul

1] http://postgr.es/m/CA+Tgmoardk4VuthHc23vov+AVkhq7eT0mFUs-2ctAnP1uiTaog@mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2026-01-19 11:22:04 Re: file_fdw: Support multi-line HEADER option.
Previous Message Hayato Kuroda (Fujitsu) 2026-01-19 10:55:00 RE: Simplify code building the LR conflict messages