Re: Fail hard if xlogreader.c fails on out-of-memory

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fail hard if xlogreader.c fails on out-of-memory
Date: 2023-09-26 22:06:37
Message-ID: CA+hUKG+-YUfpnf3sT1GxNg6+smBf2NKWBFw-E+14T55U47aNew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> Thoughts and/or comments are welcome.

I don't have an opinion yet on your other thread about making this
stuff configurable for replicas, but for the simple crash recovery
case shown here, hard failure makes sense to me.

Here are some interesting points in the history of this topic:

1999 30659d43: xl_len is 16 bit, fixed size buffer in later commits
2001 7d4d5c00: WAL files recycled, xlp_pageaddr added
2004 0ffe11ab: xl_len is 32 bit, dynamic buffer, malloc() failure ends redo
2005 21fda22e: xl_tot_len and xl_len co-exist
2014 2c03216d: xl_tot_len fully replaces xl_len
2018 70b4f82a: xl_tot_len > 1GB ends redo
2023 8fcb32db: don't let xl_tot_len > 1GB be logged!
2023 bae868ca: check next xlp_pageaddr, xlp_rem_len before allocating

Recycled pages can't fool us into making a huge allocation any more.
If xl_tot_len implies more than one page but the next page's
xlp_pageaddr is too low, then either the xl_tot_len you read was
recycled garbage bits, or it was legitimate but the overwrite of the
following page didn't make it to disk; either way, we don't have a
record, so we have an end-of-wal condition. The xlp_rem_len check
defends against the second page making it to disk while the first one
still contains recycled garbage where the xl_tot_len should be*.

What Michael wants to do now is remove the 2004-era assumption that
malloc failure implies bogus data. It must be pretty unlikely in a 64
bit world with overcommitted virtual memory, but a legitimate
xl_tot_len can falsely end recovery and lose data, as reported from a
production case analysed by his colleagues. In other words, we can
actually distinguish between lack of resources and recycled bogus
data, so why treat them the same?

For comparison, if you run out of disk space during recovery we don't
say "oh well, that's enough redoing for today, the computer is full,
let's forget about the rest of the WAL and start accepting new
transactions!". The machine running recovery has certain resource
requirements relative to the machine that generated the WAL, and if
they're not met it just can't do it. It's the same if it various
other allocations fail. The new situation is that we are now
verifying that xl_tot_len was actually logged by PostgreSQL, so if we
can't allocate space for it, we can't replay the WAL.

*A more detailed analysis would talk about sectors (page header is
atomic), and consider whether we're only trying to defend ourselves
against recycled pages written by PostgreSQL (yes), arbitrary random
data (no, but it's probably still pretty good) or someone trying to
trick us (no, and we don't stand a chance).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-09-26 22:23:18 Annoying build warnings from latest Apple toolchain
Previous Message Tom Lane 2023-09-26 21:50:29 Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)