Re: trying again to get incremental backup

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trying again to get incremental backup
Date: 2023-12-15 10:36:26
Message-ID: CAKZiRmwd3KTNMQhm9Bv4oR_1uMehXroO6kGyJQkiw9DfM8cMwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert,

On Wed, Dec 13, 2023 at 2:16 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>
> > > not even in case of an intervening
> > > timeline switch. So, all of the errors in this function are warning
> > > you that you've done something that you really should not have done.
> > > In this particular case, you've either (1) manually removed the
> > > timeline history file, and not just any timeline history file but the
> > > one for a timeline for a backup that you still intend to use as the
> > > basis for taking an incremental backup or (2) tried to use a full
> > > backup taken from one server as the basis for an incremental backup on
> > > a completely different server that happens to share the same system
> > > identifier, e.g. because you promoted two standbys derived from the
> > > same original primary and then tried to use a full backup taken on one
> > > as the basis for an incremental backup taken on the other.
> > >
> >
> > Okay, but please consider two other possibilities:
> >
> > (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> > and some cronjob just a day later attempted to take incremental and
> > failed with that error.
> >
> > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> > directory) the DB where I had cronjob that just failed with this error
> >
> > I bet that (4) is going to happen more often than (1), (2) , which
> > might trigger users to complain on forums, support tickets.
>
> Hmm. In case (4), I was thinking that you'd get a complaint about the
> database system identifier not matching. I'm not actually sure that's
> what would happen, though, now that you mention it.
>

I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:

$ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING: aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR: timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"

Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.

> > LGTM, all quick tests work from my end too. BTW: I have also scheduled
> > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> > test. I'll let you know how it went.
>
> Awesome, thank you so much.

OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did
generate pretty large incrementals (so I had to abort it due to lack
of space, I was expecting to see smaller incrementals so it took too
much space). I initally suspected that the problem lies in the normal
distribution of `\set aid random(1, 100000 * :scale)` for tpcbb that
UPDATEs on big pgbench_accounts.

$ du -sm /backups/backups/* /backups/archive/
216205 /backups/backups/full
215207 /backups/backups/incr.1
216706 /backups/backups/incr.2
102273 /backups/archive/

So I verified the recoverability yesterday anyway - the
pg_combinebackup "full incr.1 incr.2" took 44 minutes and later
archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan
foir sum(abalance) on the pgbench_accounts and other tables worked
fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys),
while performing mostly 60-80MB/s separately for both reads and writes
(it's slow, but it's due to maxed out sequence I/O of the Premium on a
small SSD on Azure).

So i've launched another improved test (to force more localized
UPDATEs) to see the more real-world space-effectiveness of the
incremental backup:

\set aid random_exponential(1, 100000 * :scale, 8)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
INSERT INTO pgbench_history (tid
, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

But then... (and i have verified the low-IDs for :aid above).. same
has happened:

backups/backups$ du -sm /backups/backups/*
210229 /backups/backups/full
208299 /backups/backups/incr.1
208351 /backups/backups/incr.2

# pgbench_accounts has relfilenodeid 16486
postgres(at)jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173
174 175 ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L
./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ;
echo; done
005c6bbb40fca3c1a0a819376ef0e793 full/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793 ./incr.1/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793 ./incr.2/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793 /var/lib/postgres/17/data/base/5/16486.5

[.. all the checksums match (!) for the above $L..]

c5117a213253035da5e5ee8a80c3ee3d full/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d ./incr.1/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d ./incr.2/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d /var/lib/postgres/17/data/base/5/16486.173

47ee6b18d7f8e40352598d194b9a3c8a full/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a ./incr.1/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a ./incr.2/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a /var/lib/postgres/17/data/base/5/16486.174

82dfeba58b4a1031ac12c23f9559a330 full/base/5/16486.175
21a8ac1e6fef3cf0b34546c41d59b2cc ./incr.1/base/5/16486.175
2c3d89c612b2f97d575a55c6c0204d0b ./incr.2/base/5/16486.175
73367d44d76e98276d3a6bbc14bb31f1 /var/lib/postgres/17/data/base/5/16486.175

So to me, it looks like it copied anyway 174 out of 175 files lowering
the effectiveness of that incremental backup to 0% .The commands to
generate those incr backups were:
pg_basebackup -v -P -c fast -D /backups/backups/incr.1
--incremental=/backups/backups/full/backup_manifest
sleep 4h
pg_basebackup -v -P -c fast -D /backups/backups/incr.2
--incremental=/backups/backups/incr1/backup_manifest

The incrementals are being generated , but just for the first (0)
segment of the relation?

/backups/backups$ ls -l incr.2/base/5 | grep INCR
-rw------- 1 postgres postgres 12 Dec 14 21:33 INCREMENTAL.112
-rw------- 1 postgres postgres 12 Dec 14 21:01 INCREMENTAL.113
-rw------- 1 postgres postgres 12 Dec 14 21:36 INCREMENTAL.1247
-rw------- 1 postgres postgres 12 Dec 14 21:38 INCREMENTAL.1247_vm
[..note, no INCREMENTAL.$int.$segment files]
-rw------- 1 postgres postgres 12 Dec 14 21:24 INCREMENTAL.6238
-rw------- 1 postgres postgres 12 Dec 14 21:17 INCREMENTAL.6239
-rw------- 1 postgres postgres 12 Dec 14 21:55 INCREMENTAL.827

# 16486 is pgbench_accounts
/backups/backups$ ls -l incr.2/base/5/*16486* | grep INCR
-rw------- 1 postgres postgres 14613480 Dec 14 21:00
incr.2/base/5/INCREMENTAL.16486
-rw------- 1 postgres postgres 12 Dec 14 21:52
incr.2/base/5/INCREMENTAL.16486_vm
/backups/backups$

/backups/backups$ find incr* -name INCREMENTAL.* | wc -l
1342
/backups/backups$ find incr* -name INCREMENTAL.*_* | wc -l # VM or FSM
236
/backups/backups$ find incr* -name INCREMENTAL.*.* | wc -l # not a
single >1GB single incremental relation
0

I'm quickly passing info and I haven't really looked at the code yet ,
but it should be somewhere around GetFileBackupMethod() and
reproducible easily with that configure --with-segsize-blocks=X
switch.

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ivan Kush 2023-12-15 11:28:12 Autonomous transactions 2023, WIP
Previous Message Alvaro Herrera 2023-12-15 10:10:24 Re: Improve eviction algorithm in ReorderBuffer