XLogReadBufferExtended() vs disconnected segments

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: XLogReadBufferExtended() vs disconnected segments
Date: 2023-02-23 01:01:47
Message-ID: 20230223010147.32oir7sb66slqnjk@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I was trying to implement ExtendRelationBufferedTo(), responding to a review
comment by Heikki, in
https://www.postgresql.org/message-id/20230222203152.rh4s75aedj65hyjn@awork3.anarazel.de

Which lead me to stare at the P_NEW do while loop in
XLogReadBufferExtended(). I first started to reply on that thread, but it
seems like a big enough issue that it seemed worth starting a separate thread.

The relevant logic was added in 6f2aead1ffec, the relevant discussion is at
https://www.postgresql.org/message-id/32313.1392231107%40sss.pgh.pa.us

My understanding of what happend there is that we tried to extend a relation,
sized one block below a segment boundary, and after that the relation was much
larger, because the next segment file existed, and had a non-zero size. And
because we extended blkno-lastblock times, we'd potentially blow up the
relation size much more than intended.

The actual cause of that in the reported case appears to have been a bug in
wal-e. But I suspect it's possible to hit something like that without such
problems, just due to crashes on the replica, or "skew" while taking a base
backup.

I find it pretty sketchy that we just leave the contents of the previously
"disconnected" segment contents around, without using log_invalid_page() for
the range, or warning, or ...

Most of the time this issue would be fixed due to later WAL replay
initializing the later segment. But I don't think there's a guarantee for
that (see below).

It'd be one thing if we accidentally used data in such a segment, if the
situation is only caused by a bug in base backup tooling, or filesystem
corruption, or ...

But I think we can encounter the issue without anything like that being
involved. Imagine this scenario:

1) filenode A gets extended to segment 3
2) basebackup starts, including performing a checkpoint
3) basebackup ends up copying A's segment 3 first, while in progress
4) filenode A is dropped
5) checkpoint happens, allowing smgrrel 10 to be used again
6) filenode 10 is created newly
7) basebackup ends

At that point A will have segment 0, segment 3. The WAL replay for 4) won't
drop segment 3, because an smgrnblocks() won't even see it, because segment 2
doesn't exist.

If a replica starts from this base backup, we'll be fine until A again grows
far enough to fill segment 2. At that point, we'll suddenly have completely
bogus contents in 3. Obviously accesses to those contents could trivially
crash at that point.

I suspect there's an easier to hit version of this: Consider this path in
ExecuteTruncateGuts():

/*
* Normally, we need a transaction-safe truncation here. However, if
* the table was either created in the current (sub)transaction or has
* a new relfilenumber in the current (sub)transaction, then we can
* just truncate it in-place, because a rollback would cause the whole
* table or the current physical file to be thrown away anyway.
*/
if (rel->rd_createSubid == mySubid ||
rel->rd_newRelfilelocatorSubid == mySubid)
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);

Afaict that could easily lead to a version of the above that doesn't even
require relfilenodes getting recycled.

One way to to defend against this would be to make mdextend(), whenever it
extends into the last block of a segment, unlink the next segment - it can't
be a validly existing contents. But it seems scary to just unlink entire
segments.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-23 01:12:28 Re: XLogReadBufferExtended() vs disconnected segments
Previous Message Michael Paquier 2023-02-23 00:43:24 Re: meson and sslfiles.mk in src/test/ssl/