Re: WIP: WAL prefetch (another approach)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-05-05 01:08:35
Message-ID: 20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-05-04 15:47:41 -0400, Tom Lane wrote:
> BTW, that conclusion shouldn't distract us from the very real bug
> that Andres identified. I was just scraping the buildfarm logs
> concerning recent failures, and I found several recent cases
> that match the symptom he reported:
> [...]
> They all show the standby in recovery/019_replslot_limit.pl failing
> with symptoms like
>
> 2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down in recovery at 2021-05-04 07:41:39 UTC
> 2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode
> 2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8
> 2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state reached at 0/1D00000
> 2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request size 1476397045
> 2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to accept read only connections
> 2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) exited with exit code 1

Yea, that's the pre-existing end-of-log-issue that got more likely as
well as more consequential (by accident) in Thomas' patch. It's easy to
reach parity with the state in 13, it's just changing the order in one
place.

But I think we need to do something for all branches here. The bandaid
that was added to allocate_recordbuf() does doesn't really seems
sufficient to me. This is

commit 70b4f82a4b5cab5fc12ff876235835053e407155
Author: Michael Paquier <michael(at)paquier(dot)xyz>
Date: 2018-06-18 10:43:27 +0900

Prevent hard failures of standbys caused by recycled WAL segments

In <= 13 the current state is that we'll allocate effectively random
bytes as long as the random number is below 1GB whenever we reach the
end of the WAL with the record on a page boundary (because there we
don't. That allocation is then not freed for the lifetime of the
xlogreader. And for FRONTEND uses of xlogreader we'll just happily
allocate 4GB. The specific problem here is that we don't validate the
record header before allocating when the record header is split across a
page boundary - without much need as far as I can tell? Until we've read
the entire header, we actually don't need to allocate the record buffer?

This seems like an issue that needs to be fixed to be more robust in
crash recovery scenarios where obviously we could just have failed with
half written records.

But the issue that 70b4f82a4b is trying to address seems bigger to
me. The reason it's so easy to hit the issue is that walreceiver does <
8KB writes into recycled WAL segments *without* zero-filling the tail
end of the page - which will commonly be filled with random older
contents, because we'll use a recycled segments. I think that
*drastically* increases the likelihood of finding something that looks
like a valid record header compared to the situation on a primary where
the zeroing pages before use makes that pretty unlikely.

> (BTW, the behavior seen here where the failure occurs *immediately*
> after reporting "consistent recovery state reached" is seen in the
> other reports as well, including Andres' version. I wonder if that
> means anything.)

That's to be expected, I think. There's not a lot of data that needs to
be replayed, and we'll always reach consistency before the end of the
WAL unless you're dealing with starting from an in-progress base-backup
that hasn't yet finished or such. The test causes replication to fail
shortly after that, so we'll always switch to doing recovery from
pg_wal, which then will hit the end of the WAL, hitting this issue with,
I think, ~25% likelihood (data from recycled WAL data is probably
*roughly* evenly distributed, and any 4byte value above 1GB will hit
this error in 14).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2021-05-05 01:15:11 Re: [PATCH] Identify LWLocks in tracepoints
Previous Message Andres Freund 2021-05-05 00:40:36 Re: MaxOffsetNumber for Table AMs