Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-08-15 06:11:16
Message-ID: ZNsXBFsFsKcCbP0q@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Aug 15, 2023 at 12:00:30PM +0900, Michael Paquier wrote:
> Not sure if that will help, but what I was playing with some stuff in
> the lines of:
> -- Store the length up to page boundary.
> select setting::int - ((pg_current_wal_insert_lsn() - '0/0') %
> setting::int) as boundary from pg_settings where name = 'wal_block_size'
> \gset
> -- Generate record up to boundary (56 bytes for base size of the record,
> -- stop at 12 bytes before the end of the page.
> select pg_logical_emit_message(false, '', repeat('a', :boundary - 56 - 12));
>
> Then by injecting some FF's on the last page written and forcing
> replay I am able to force some of the error code paths, so I guess
> that's what you were basically doing?

I've been spending some extra time on this one and hacked a TAP test
that reliably reproduces the original issue, using a message similar
to what I mentioned in my previous messages. I guess that we could
use something like that:
2023-08-15 15:07:03.790 JST [8729] LOG: redo starts at 0/14EA428
2023-08-15 15:07:03.790 JST [8729] FATAL: invalid memory alloc
request size 4294969740 2023-08-15
15:07:03.791 JST [8726] LOG: startup process (PID 8729) exited with exit code 1

The proposed patches pass the test, HEAD does not. We may want to do
more with page boundaries, and more error patterns, but the idea looks
worth exploring more. At least this can be used to validate patches.

I've noticed while hacking the test that we don't do a XLogFlush()
after inserting the message's record, so we may lose it on crash.
That makes the test unstable except if an extra record is added after
the logical messages. The attached patch forces that for the sake of
the test, but I'm spawning a different thread as losing this data
looks like a bug to me.
--
Michael

Attachment Content-Type Size
0001-Add-test-to-emulate-random-garbage-data-during-WAL-r.patch text/x-diff 4.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2023-08-15 14:40:53 Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
Previous Message Michael Paquier 2023-08-15 03:00:30 Re: BUG #17928: Standby fails to decode WAL on termination of primary