Re: base backup vs. concurrent truncation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: base backup vs. concurrent truncation
Date: 2023-05-08 20:41:09
Message-ID: 20230508204109.yxatpsdsx3k2dj5m@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-04-25 10:28:58 -0700, Andres Freund wrote:
> On 2023-04-25 11:42:43 -0400, Robert Haas wrote:
> > On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > What we've discussed somewhere in the past is to always truncate N+1 when
> > > creating the first page in N. I.e. if we extend into 23456.1, we truncate
> > > 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue?
> >
> > Yeah, although leaving 23456.2 forever unless and until that happens
> > doesn't sound amazing.
>
> It isn't - but the alternatives aren't great either. It's not that easy to hit
> this scenario, so I think something along these lines is more palatable than
> adding a pass through the entire data directory.

> I think eventually we'll have to make the WAL logging bulletproof enough
> against this to avoid the risk of it. I think that is possible.

I think we should extend my proposal above with improved WAL logging.

Right now the replay of XLOG_SMGR_TRUNCATE records uses the normal
smgrtruncate() path - which essentially relies on smgrnblocks() to determine
the relation size, which in turn iterates over the segments until it finds one
< SEGSIZE.

That's fine during normal running, where we are consistent. But it's bogus
when we're not consistent - in case of a concurrent truncation, the base
backup might have missed intermediate segments, while copying later segments.

We should fix this by including not just the "target" length in the WAL
record, but also the "current" length. Then during WAL replay of such records
we'd not look at the files currently present, we'd just stupidly truncate all
the segments mentioned in the range.

I think we ought to do the same for mdunlinkfork().

> I suspect we would need to prevent checkpoints from happening in the wrong
> moment, if we were to go down that route.
>
> I guess that eventually we'll need to polish the infrastructure for
> determining restartpointsm so that delayChkptFlags doesn't actually prevent
> checkpoints, just moves the restart to a safe LSN. Although I guess that
> truncations aren't frequent enough (compared to transaction commits), for that
> to be required "now".
>

Unfortunately the current approach of smgrtruncate records is quite racy with
checkpoints. You can end up with a sequence of something like

1) SMGRTRUNCATE record
2) CHECKPOINT;
3) Truncating of segment files

if you crash anywhere in 3), we don't replay the SMGRTRUNCATE record. It's not
a great solution, but I suspect we'll just have to set delayChkptFlags during
truncations to prevent this.

I also think that we need to fix the order in which mdunlink() operates. It
seems *quite* bogus that we unlink in "forward" segment order, rather than in
reverse order (like mdtruncate()). If we crash after unlinking segment.N,
we'll not redo unlinking the later segments after a crash. Nor in WAL
replay. While the improved WAL logging would address this as well, it still
seems pointlessly risky to iterate forward, rather than backward.

Even with those changes, I think we might still need something like the
"always truncate the next segment" bit I described in my prior email though.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2023-05-08 21:06:38 Re: pg_stat_io for the startup process
Previous Message Andres Freund 2023-05-08 20:28:03 Re: base backup vs. concurrent truncation