Re: base backup vs. concurrent truncation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: base backup vs. concurrent truncation
Date: 2023-05-08 20:28:03
Message-ID: 20230508202803.eipcpgfeeeekmkej@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-05-08 08:57:08 -0400, Robert Haas wrote:
> On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev
> <aleksander(at)timescale(dot)com> wrote:
> > So I'm still unable to reproduce the described scenario, at least on PG16.
>
> Well, that proves that either (1) the scenario that I described is
> impossible for some unknown reason or (2) something is wrong with your
> test scenario. I bet it's (2), but if it's (1), it would be nice to
> know what the reason is. One can't feel good about code that appears
> on the surface to be broken even if one knows that some unknown
> magical thing is preventing disaster.

It seems pretty easy to create disconnected segments. You don't even need a
basebackup for it.

To make it easier, I rebuilt with segsize_blocks=16. This isn't required, it
just makes it a lot cheaper to play around. To noones surprise: I'm not a
patient person...

Started server with autovacuum=off.

DROP TABLE IF EXISTS large;
CREATE TABLE large AS SELECT generate_series(1, 100000);
SELECT current_setting('data_directory') || '/' || pg_relation_filepath('large');

ls -l /srv/dev/pgdev-dev/base/5/24585*
shows lots of segments.

attach gdb, set breakpoint on truncate.

DROP TABLE large;

breakpoint will fire. Continue once.

In concurrent session, trigger checkpoint. Due to the checkpoint we'll not
replay any WAL record. And the checkpoint will unlink the first segment.

Kill the server.

After crash recovery, you end up with all but the first segment still
present. As the first segment doesn't exist anymore, nothing prevents that oid
from being recycled in the future. Once it is recycled and the first segment
grows large enough, the later segments will suddenly re-appear.

It's not quite so trivial to reproduce issues with partial truncations /
concurrent base backups. The problem is that it's hard to guarantee the
iteration order of the base backup process. You'd just need to write a manual
base backup script though.

Consider a script mimicking the filesystem returning directory entries in
"reverse name order". Recipe includes two sessions. One (BB) doing a base
backup, the other (DT) running VACUUM making the table shorter.

BB: Copy <relfilenode>.2
BB: Copy <relfilenode>.1
SS: Truncate relation to < SEGSIZE
BB: Copy <relfilenode>

The replay of the smgrtruncate record will determine the relation size to
figure out what segments to remove. Because <relfilenode> is < SEGSIZE it'll
only truncate <relfilenode>, not <relfilenode>.N. And boom, a disconnected
segment.

(I'll post a separate email about an evolved proposal about fixing this set of
issues)

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-05-08 20:41:09 Re: base backup vs. concurrent truncation
Previous Message Tom Lane 2023-05-08 19:58:41 Re: [PATCH] Add native windows on arm64 support