Issue with recovery_target = 'immediate'

From: David Steele <david(at)pgmasters(dot)net>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Issue with recovery_target = 'immediate'
Date: 2022-07-13 18:41:40
Message-ID: 5a6ce6bf-2a27-2b90-5cdb-14f6c89ca238@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

We have noticed an issue when performing recovery with recovery_target =
'immediate' when the latest timeline cannot be replayed to from the
backup (current) timeline.

First, create two backups:

$ pgbackrest --stanza=demo --type=full --start-fast backup
$ pgbackrest --stanza=demo --type=full --start-fast backup
$ pgbackrest info

<snip>
full backup: 20220713-175710F
timestamp start/stop: 2022-07-13 17:57:10 / 2022-07-13 17:57:14
wal start/stop: 000000010000000000000003 /
000000010000000000000003
database size: 23.2MB, database backup size: 23.2MB
repo1: backup set size: 2.8MB, backup size: 2.8MB

full backup: 20220713-175748F
timestamp start/stop: 2022-07-13 17:57:48 / 2022-07-13 17:57:52
wal start/stop: 000000010000000000000005 /
000000010000000000000005
database size: 23.2MB, database backup size: 23.2MB
repo1: backup set size: 2.8MB, backup size: 2.8MB

Restore the first backup:

$ pg_ctlcluster 13 demo stop
$ pgbackrest --stanza=demo --delta --set=20220713-175710F
--type=immediate --target-action=promote restore

Recovery settings:

$ cat /var/lib/postgresql/13/demo/postgresql.auto.conf

<snip>
restore_command = 'pgbackrest --stanza=demo archive-get %f "%p"'
recovery_target = 'immediate'
recovery_target_action = 'promote'

Starting PostgreSQL performs recovery as expected:

$ pg_ctlcluster 13 demo start
$ cat /var/log/postgresql/postgresql-13-demo.log

LOG: database system was interrupted; last known up at 2022-07-13
17:57:10 UTC
LOG: starting point-in-time recovery to earliest consistent point
LOG: restored log file "000000010000000000000003" from archive
LOG: redo starts at 0/3000028
LOG: consistent recovery state reached at 0/3000138
LOG: recovery stopping after reaching consistency
LOG: redo done at 0/3000138
LOG: database system is ready to accept read only connections
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections

Now restore the second backup (recovery settings are identical):

$ pg_ctlcluster 13 demo stop
$ pgbackrest --stanza=demo --delta --set=20220713-175748F
--type=immediate --target-action=promote restore

Recovery now fails:

$ pg_ctlcluster 13 demo start
$ cat /var/log/postgresql/postgresql-13-demo.log

LOG: database system was interrupted; last known up at 2022-07-13
17:57:48 UTC
LOG: restored log file "00000002.history" from archive
LOG: starting point-in-time recovery to earliest consistent point
LOG: restored log file "00000002.history" from archive
LOG: restored log file "000000010000000000000005" from archive
FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/5000060 on timeline 1, but in the
history of the requested timeline, the server forked off from that
timeline at 0/3000138.
LOG: startup process (PID 511) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down

While it is certainly true that timeline 2 cannot be replayed to from
timeline 1, it should not matter for an immediate recovery that stops at
consistency. No timeline switch will occur until promotion. Of course
the cluster could be shut down before promotion and the target changed,
but in that case couldn't timeline be adjusted at that point?

This works by default for PostgreSQL < 12 because the default timeline
is current. Since PostgreSQL 12 the default has been latest, which does
not work by default.

When a user does a number of recoveries it is pretty common for the
timelines to get in a state that will make most subsequent recoveries
fail. We think it makes sense for recovery_target = 'immediate' to be a
fail safe that always works no matter the state of the latest timeline.

Our solution has been to force recovery_target_timeline = 'current' when
recovery_target = 'immediate', but it seems like this is something that
should be done in PostgreSQL instead.

Thoughts?

Regards,
-David

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-07-13 18:53:50 Re: allow building trusted languages without the untrusted versions
Previous Message Nathan Bossart 2022-07-13 18:38:03 Re: Add --{no-,}bypassrls flags to createuser