Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-09-02 01:29:39
Message-ID: ZPKQA+HHqLVRFBSs@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Aug 21, 2023 at 08:32:39AM +0900, Michael Paquier wrote:
> I am not sure that I'll be able to do more on this topic this week, at
> least that's some progress.

Some time later..

I have spent more time and thoughts on the whole test suite, finishing
with the attached 0003 that applies on top of your own patches. I am
really looking forward to making this whole logic more robust, so as
WAL replay can be made itself more robust for the OOM/end-of-wal
detection on HEAD for standbys and crash recovery.

While looking at the whole, I have considered a few things that may
make the test cleaner, like:
- Calculating the segment name and its offset from the end_lsn of a
record directly from the backend, but it felt inelegant to pass
through more subroutine layers the couple ($segment, offset) rather
than just a LSN, so guessing the segment number and the offset while
the cluster is offline if OK by me.
- The TLI can be queried from the server rather than hardcoded.
- I've been thinking about bundling the tests of each sub-section in
their own subroutine, but that felt a bit awkward, particularly for
the part where we need a correct $prev_lsn in the record header
written to enforce other code paths.
- The test needs better documentation. One of the things I kept
staring at is cross-checking pack() and the dependency to the C
structures, so I have added more details there, explaining more the
whys and the hows.

I have also looked again at the C code for a few hours, and still got
the impression that this is rather solid. There are two things that
may be better:
- Document at the top of allocate_recordbuf() that this should never
be called with a length coming from a header until it is validated.
- Removing AllocSizeIsValid() for the non-FRONTEND path should be OK.

What do you think?
--
Michael

Attachment Content-Type Size
v3-0001-Add-TAP-tests-for-end-of-WAL-conditions.patch text/x-diff 12.8 KB
v3-0002-Don-t-trust-unvalidated-xl_tot_len.patch text/x-diff 7.8 KB
v3-0003-Adjust-few-things-on-top-of-Thomas-patch-series.patch text/x-diff 22.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergei Kornilov 2023-09-02 11:51:48 Re: BUG #17928: Standby fails to decode WAL on termination of primary
Previous Message Michael Paquier 2023-09-02 00:08:22 Re: BUG #17973: Reinit of pgstats entry for dropped DB can break autovacuum daemon