Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-08-15 02:36:10
Message-ID: CA+hUKGLzv=8g3MFwm+TGFu+7koypgmU7qKHni+nrEnuJ8P6eog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Aug 15, 2023 at 2:05 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> /*
> * Try to find space to decode this record, if we can do so without
> * calling palloc. If we can't, we'll try again below after we've
> * validated that total_len isn't garbage bytes from a recycled WAL page.
> */
> decoded = XLogReadRecordAlloc(state,
> total_len,
> false /* allow_oversized */ );
>
> The patch relies on total_len before the header validation is
> completed, meaning that this value of total_len could be invalid
> because it comes from the partially-read header.. Oh wait, that's
> actually OK because an oversized allocation is not allowed yet and the
> decode buffer would not be able to store more than what it can?
> Perhaps this comment should be updated to tell something like, adding
> that total_len can be garbage, but we don't care because we don't
> allocate anything that the decode buffer cannot hold.

Yeah.

> #ifndef FRONTEND
>
> - /*
> - * Note that in much unlucky circumstances, the random data read from a
> - * recycled segment can cause this routine to be called with a size
> - * causing a hard failure at allocation. For a standby, this would cause
> - * the instance to stop suddenly with a hard failure, preventing it to
> - * retry fetching WAL from one of its sources which could allow it to move
> - * on with replay without a manual restart. If the data comes from a past
> - * recycled segment and is still valid, then the allocation may succeed
> - * but record checks are going to fail so this would be short-lived. If
> - * the allocation fails because of a memory shortage, then this is not a
> - * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
> - */
> if (!AllocSizeIsValid(newSize))
> return false;
>
> Wouldn't it be OK to drop entirely this check? I'm fine to keep it as
> a safety measure, but it should not be necessary now that it gets
> called only once the header is validated?

Yeah, now we're in "shouldn't happen" territory, and I'm not sure.

> Regarding the tests, I have been using this formula to produce the
> number of bytes until the next page boundary:
> select setting::int - ((pg_current_wal_lsn() - '0/0') % setting::int)
> from pg_settings where name = 'wal_block_size';
>
> Using pg_logical_emit_message() non-transactional with an empty set of
> strings generates records of 56 bytes, so I was thinking about the
> following:
> - Generate a bunch of small records with pg_logical_emit_message(), or
> records based on the threshold with the previous formula.
> - Loop until we reach a page limit, at 24 bytes (?).
> - Generate one last record to cut through.
> - Stop the node in immediate mode.
> - Write some garbage bytes on the last page generated to emulate the
> recycled contents and an allocation
> - Start the node, which should be able to startup.
> With wal_level = minimal, autovacuum = off and a large
> checkpoint_timeout, any other records are minimized. That's fancy,
> though.
>
> Do you think that this could work reliably?

Yeah, I think that sounds quite promising, and funnily enough I was
just now working on some Perl code that appends controlled junk to the
WAL in a test like that so we can try to hit all the error paths...

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2023-08-15 03:00:30 Re: BUG #17928: Standby fails to decode WAL on termination of primary
Previous Message Michael Paquier 2023-08-15 02:04:49 Re: BUG #17928: Standby fails to decode WAL on termination of primary