base backup vs. concurrent truncation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: base backup vs. concurrent truncation
Date: 2023-04-21 13:42:57
Message-ID: CA+TgmoZFUMH8ghewsjEkd0Ntbkfa6p4eBv-bVP6z9-GtQw13tA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Apologies if this has already been discussed someplace, but I couldn't
find a previous discussion. It seems to me that base backups are
broken in the face of a concurrent truncation that reduces the number
of segments in a relation.

Suppose we have a relation that is 1.5GB in size, so that we have two
files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back
those files up in whichever order the directory scan finds them.
Suppose we back up 23456.1 first. Then the relation is truncated to
0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we
back up the file 23456. Now our backup contains files 23456 and
23456.1, each 0.5GB. But this breaks the invariant in md.c:

* On disk, a relation must consist of consecutively numbered segment
* files in the pattern
* -- Zero or more full segments of exactly RELSEG_SIZE blocks each
* -- Exactly one partial segment of size 0 <= size <
RELSEG_SIZE blocks
* -- Optionally, any number of inactive segments of size 0 blocks.

basebackup.c's theory about relation truncation is that it doesn't
really matter because WAL replay will fix things up. But in this case,
I don't think it will, because WAL replay relies on the above
invariant holding. As mdnblocks says:

/*
* If segment is exactly RELSEG_SIZE, advance to next one.
*/
segno++;

So I think what's going to happen is we're not going to notice 23456.1
when we recover the backup. It will just sit there as an orphaned file
forever, unless we extend 23456 back to a full 1GB, at which point we
might abruptly start considering that file part of the relation again.

Assuming I'm not wrong about all of this, the question arises: whose
fault is this, and what to do about it? It seems to me that it's a bit
hard to blame basebackup.c, because if you used pg_backup_start() and
pg_backup_stop() and copied the directory yourself, you'd have exactly
the same situation, and while we could (and perhaps should) teach
basebackup.c to do something smarter, it doesn't seem realistic to
impose complex constraints on the user's choice of file copy tool.
Furthermore, I think that the problem could arise without performing a
backup at all: say that the server crashes on the OS level in
mid-truncation, and the truncation of segment 0 reaches disk but the
removal of segment 1 does not.

So I think the problem is with md.c assuming that its invariant must
hold on a cluster that's not guaranteed to be in a consistent state.
But mdnblocks() clearly can't try to open every segment up to whatever
the maximum theoretical possible segment number is every time it's
invoked, because that would be wicked expensive. An idea that occurs
to me is to remove all segment files following the first partial
segment during startup, before we begin WAL replay. If that state
occurs at startup, then either we have a scenario involving
truncation, like those above, or a scenario involving relation
extension, where we added a new segment and that made it to disk but
the prior extension of the previous last segment file to maximum
length did not. But in that case, WAL replay should, I think, fix
things up. However, I'm not completely sure that there isn't some hole
in this theory, and this way forward also doesn't sound particularly
cheap. Nonetheless I don't have another idea right now.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-04-21 13:50:48 Re: Commitfest 2023-03 starting tomorrow!
Previous Message Benoit Lobréau 2023-04-21 13:04:01 Logging parallel worker draught