Re: pg_rewind WAL segments deletion pitfall

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: cyberdemn(at)gmail(dot)com
Cc: bungina(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pg_rewind WAL segments deletion pitfall
Date: 2022-08-30 05:50:26
Message-ID: 20220830.145026.1609145145128999932.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello, Alex.

At Fri, 26 Aug 2022 10:57:25 +0200, Alexander Kukushkin <cyberdemn(at)gmail(dot)com> wrote in
> On Fri, 26 Aug 2022 at 10:04, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
> wrote:
> > What I don't still understand is why pg_rewind doesn't work for the
> > old primary in that case. When archive_mode=on, the old primary has
> > the complete set of WAL files counting both pg_wal and its archive. So
> > as the same to the privious repro, pg_rewind -c ought to work (but it
> > uses its own archive this time). In that sense the proposed solution
> > is still not needed in this case.
> >
>
> The pg_rewind finishes successfully. But as a result it removes some files
> from pg_wal that are required to perform recovery because they are missing
> on the new primary.

IFAIS pg_rewind doesn't. -c option contrarily restores the all
segments after the last (common) checkpoint and all of them are left
alone after pg_rewind finishes. postgres itself removes the WAL files
after recovery. After-promotion cleanup and checkpoint revmoes the
files on the previous timeline.

Before pg_rewind runs in the repro below, the old primary has the
following segments.

TLI1: 2 8 9 A B C D

Just after pg_rewind finishes, the old primary has the following
segments.

TLI1: 2 3 5 6 7
TLI2: 4 (and 00000002.history)

pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
1-D have been removed since the new primary didn't have them.

Recovery starts from 1-3 and promotes at 0/4_000000. postgres removes
1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
restartpoint. All of the segments are useless after the old primary
promotes.

When the old primary starts, it uses 1-3 and 2-4 for recovery and
fails to fetch 2-5 from the new primary. But it is not an issue of
pg_rewind at all.

> > A bit harder situation comes after the server successfully rewound; if
> > the new primary goes so far that the old primary cannot connect. Even
> > in that case, you can copy-in the requried WAL files or configure
> > restore_command of the old pimary so that it finds required WAL files
> > there.
> >
>
> Yes, we can do the backup of pg_wal before running pg_rewind, but it feels

So, if I understand you correctly, the issue you are complaining is
not about the WAL segments on the old timeline but about those on the
new timeline, which don't have a business with what pg_rewind does. As
the same with the case of pg_basebackup, the missing segments need to
be somehow copied from the new primary since the old primary never had
the chance to have them before.

> very ugly, because we will also have to clean this "backup" after a
> successful recovery.

What do you mean by the "backup" here? Concretely what WAL segments do
you feel need to remove, for example, in the repro case? Or, could
you show your issue by something like the repro below?

> It would be much better if pg_rewind didn't remove WAL files between the
> last common checkpoint and diverged LSN in the first place.

Thus I don't follow this..

regards.

(Fixed a bug and slightly modified)
====
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);"`
psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim # fails with "WAL file has been removed"

# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)

postgres -D oldprim
====

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-08-30 06:44:00 Re: Excessive number of replication slots for 12->14 logical replication
Previous Message Richard Guo 2022-08-30 02:21:43 Re: foreign join error "variable not found in subplan target list"

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-08-30 06:42:57 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Justin Pryzby 2022-08-30 05:44:41 Re: shadow variables - pg15 edition