Re: trying again to get incremental backup

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trying again to get incremental backup
Date: 2023-11-29 14:06:19
Message-ID: CA+TgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 15, 2023 at 9:14 AM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> so I've spent some time playing still with patchset v8 (without the
> 6/6 testing patch related to wal_level=minimal), with the exception of
> - patchset v9 - marked otherwise.

Thanks, as usual, for that.

> 2. Usability thing: I hit the timeout hard: "This backup requires WAL
> to be summarized up to 0/90000D8, but summarizer has only reached
> 0/0." with summarize_wal=off (default) but apparently this in TODO.
> Looks like an important usability thing.

All right. I'd sort of forgotten about the need to address that issue,
but apparently, I need to re-remember.

> 5. On v8 i've finally played a little bit with standby(s) and this
> patchset with couple of basic scenarios while mixing source of the
> backups:
>
> a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
> # sometimes i'm getting spurious error like those when doing
> incrementals on standby with -c fast :
> 2023-11-15 13:49:05.721 CET [10573] LOG: recovery restart point
> at 0/A000028
> 2023-11-15 13:49:07.591 CET [10597] WARNING: aborting backup due
> to backend exiting before pg_backup_stop was called
> 2023-11-15 13:49:07.591 CET [10597] ERROR: manifest requires WAL
> from final timeline 1 ending at 0/A0000F8, but this backup starts at
> 0/A000028
> 2023-11-15 13:49:07.591 CET [10597] STATEMENT: BASE_BACKUP (
> INCREMENTAL, LABEL 'pg_basebackup base backup', PROGRESS,
> CHECKPOINT 'fast', WAIT 0, MANIFEST 'yes', TARGET 'client')
> # when you retry the same pg_basebackup it goes fine (looks like
> CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
> deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
> against primary just before pg_basebackup --incr on standby
> workarounds it)
>
> b. full on primary, incr1 on standby, full db restore (incl. incr1) on
> standby # WORKS
> c. full on standby, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
> d. full on primary, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
>
> * - needs pg_promote() due to the controlfile having standby bit +
> potential fiddling with postgresql.auto.conf as it is having
> primary_connstring GUC.

Well, "manifest requires WAL from final timeline 1 ending at
0/A0000F8, but this backup starts at 0/A000028" is a valid complaint,
not a spurious error. It's essentially saying that WAL replay for this
incremental backup would have to begin at a location that is earlier
than where replay for the earlier backup would have to end while
recovering that backup. It's almost like you're trying to go backwards
in time, with the incremental happening before the full backup instead
of after it. I think the reason this is happening is that when you
take a backup, recovery has to start from the previous checkpoint. On
the primary, we perform a new checkpoint and plan to start recovery
from it. But on a standby, we can't perform a new checkpoint, since we
can't write WAL, so we arrange for recovery of the backup to begin
from the most recent checkpoint. And if you do two backups on the
standby in a row without much happening in the middle, then the most
recent checkpoint will be the same for both. And that I think is
what's resulting in this error, because the end of the backup follows
the start of the backup, so if two consecutive backups have the same
start, then the start of the second one will precede the end of the
first one.

One thing that's interesting to note here is that there is no point in
performing an incremental backup under these circumstances. You would
accrue no advantage over just letting replay continue further from the
full backup. The whole point of an incremental backup is that it lets
you "fast forward" your older backup -- you could have just replayed
all the WAL from the older backup until you got to the start LSN of
the newer backup, but reconstructing a backup that can start replay
from the newer LSN directly is, hopefully, quicker than replaying all
of that WAL. But in this scenario, you're starting from the same
checkpoint no matter what -- the amount of WAL replay required to
reach any given LSN will be unchanged. So storing an incremental
backup would be strictly a loss.

Another interesting point to consider is that you could also get this
complaint by doing something like take the full backup from the
primary, and then try to take an incremental backup from a standby,
maybe even a time-delayed standby that's far behind the primary. In
that case, you would really be trying to take an incremental backup
before you actually took the full backup, as far as LSN time goes.

I'm not quite sure what to do about any of this. I think the error is
correct and accurate, but understanding what it means and why it's
happening and what to do about it is probably going to be difficult
for people. Possibly we should have documentation that talks you
through all of this. Or possibly there are ways to elaborate on the
error message itself. But I'm a little skeptical about the latter
approach because it's all so complicated. I don't know that we can
summarize it in a sentence or two.

> 6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
> more recent pg_basebackup (e.g. from pg18 one day) running against
> pg17 in the scope of having this incremental backups possibility. Is
> it going to be safe? (currently there seems to be no safeguards
> against such use) or should those things (core, pg_basebackup) should
> be running in version lock step?

I think it should be safe, actually. pg_basebackup has no reason to
care about WAL format changes across versions. It doesn't even care
about the format of the WAL summaries, which it never sees, but only
needs the server to have. If we change the format of the incremental
files that are included in the backup, then we will need
backward-compatibility code, or we can disallow cross-version
operations. I don't currently foresee a need to do that, but you never
know. It's manageable in any event.

But note that I also didn't (and can't, without a lot of ugliness)
make pg_combinebackup version-independent. So you could think of
taking incremental backups with a different version of pg_basebackup,
but if you want to restore you're going to need a matching version of
pg_combinebackup.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2023-11-29 14:27:20 Re: Table AM Interface Enhancements
Previous Message Pavel Borisov 2023-11-29 13:55:38 Re: Table AM Interface Enhancements