Re: pg_rewind WAL segments deletion pitfall

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: cyberdemn(at)gmail(dot)com
Cc: bungina(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pg_rewind WAL segments deletion pitfall
Date: 2022-08-30 07:39:45
Message-ID: 20220830.163945.294488629720711896.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

At Tue, 30 Aug 2022 14:50:26 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> IFAIS pg_rewind doesn't. -c option contrarily restores the all
> segments after the last (common) checkpoint and all of them are left
> alone after pg_rewind finishes. postgres itself removes the WAL files
> after recovery. After-promotion cleanup and checkpoint revmoes the
> files on the previous timeline.
>
> Before pg_rewind runs in the repro below, the old primary has the
> following segments.
>
> TLI1: 2 8 9 A B C D
>
> Just after pg_rewind finishes, the old primary has the following
> segments.
>
> TLI1: 2 3 5 6 7
> TLI2: 4 (and 00000002.history)
>
> pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
1> primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
> 1-D have been removed since the new primary didn't have them.
>
> Recovery starts from 1-3 and promotes at 0/4_000000. postgres removes
> 1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
> restartpoint. All of the segments are useless after the old primary
> promotes.
>
> When the old primary starts, it uses 1-3 and 2-4 for recovery and
> fails to fetch 2-5 from the new primary. But it is not an issue of
> pg_rewind at all.

Ah. I think I understand what you are mentioning. If the new primary
didn't have the segment 1-3 to 1-6, pg_rewind removes it. The new
primary doesn't have it in pg_wal nor in archive. The old primary has
it in its archive. So get out from the situation, we need to the
following *two* things before the old primary can start:

1. copy 1-3 to 1-6 from the archive of the *old* primary
2. copy 2-7 and later from the archive of the *new* primary

Since pg_rewind have copied in to the old primary's pg_wal, removing them just have users to perform the task duplicatedly, as you stated.

Okay, I completely understand the problem and convinced that it is
worth changing the behavior.

However, the proposed patch looks too complex to me. It can be done
by just comparing xlog file name and the last checkpoint location and
TLI in decide_file_actions().

regards.

=====
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);"`
for i in $(seq 1 5); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

#### copy the missing file of the old timeline
## cp oldarch/00000001000000000000000[3456] oldprim/pg_wal
## cp newarch/00000002000000000000000* oldprim/pg_wal

postgres -D oldprim # fails with "WAL file has been removed"

# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)

postgres -D oldprim
=====

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Etsuro Fujita 2022-08-30 07:50:46 Re: foreign join error "variable not found in subplan target list"
Previous Message Alexander Kukushkin 2022-08-30 06:56:10 Re: pg_rewind WAL segments deletion pitfall

Browse pgsql-hackers by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2022-08-30 07:43:06 RE: Data is copied twice when specifying both child and parent table in publication
Previous Message Peter Eisentraut 2022-08-30 07:20:53 Re: postgres_fdw hint messages