Re: pg_rewind WAL segments deletion pitfall

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: bungina(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: pg_rewind WAL segments deletion pitfall
Date: 2022-08-25 08:34:40
Message-ID: CAFh8B=nqknwnLYU9fjVf6wwd2VceMHYuLk0bV3ta64yjF2kx9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello Kyotaro,

On Thu, 25 Aug 2022 at 09:49, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
wrote:

> In the first place, this is not a bug. (At least doesn't seem.)
> If you mean to propose behavioral changes, -hackers is the place.
>

Well, maybe... We can always change it.

> > 8. When pg_rewind finished successfully, we see that the WAL segments
> > with .ready files are removed, because they were already absent on the
> > promoted replica. We end up in a situation where we completely lose
> some
> > WAL segments, even though we had a clear sign that they were not
> > archived and
> > more importantly, pg_rewind read these segments while collecting
> > information about the data blocks.
>
> In terms of syncing the old primary to the new primary, no data has
> been lost. The "lost" segments are anyway unusable for the new primary
> since they no longer compatible with it. How do you intended to use
> the WAL files for the incompatible cluster?
>

These files are required for the old primary to start as a replica.

>
> > 9. The old primary fails to start because of the missing WAL segments
> > (more strictly, the records between the last common checkpoint and the
> > point of divergence) with the following log record: "ERROR:
> requested WAL
> > segment 000000020004D20200000096 has already been removed"
>
> That means that the tail end of the rewound old primary has been lost
> on the new primary's pg_wal.

Correct. The old primary was down for about 20m and we have
checkpoint_timeout = 5m, so the new primary already recycled them.

> In that case, you need to somehow
> copy-in the archived WAL files on the new primary. You can just do
> that or you can set up restore_command properly.
>

These files never made it to the archive because the server crashed. The
only place where they existed was pg_wal in the old primary.

> > Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
> > of creating filemap for WALs? Can it preserve the WAL segments that
> contain
> > those potentially lost records (> the last common checkpoint and < the
> > point of divergence) on the target? (see the patch attached)
>
> Since they are not really needed once rewind completes.
>

The pg_rewind creates the backup_label file with START WAL LOCATION and
CHECKPOINT LOCATION that point to the last common checkpoint.
Removed files are between the last common checkpoint and diverged WAL
location, and therefore are required for Postgres to do successful recovery.
Since these files never made it to the archive and are also absent on the
new primary the old primary can't start as a replica.
And I will emphasize one more time, that these files were removed by
pg_rewind despite the known fact that they are required to perform a
recovery.

>
> > If I am missing something however, please correct me or explain why it is
> > not possible to implement this straightforward solution.
>
> Maybe you're mistaking the operation.

We are not (Patroni author is here).

> If I understand the situation
> correctly, I think the following steps replays your "issue" and then
> resolve that.
>
>
> # killall -9 postgres
> # rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
> mkdir newarch oldarch
> initdb -k -D oldprim
> echo "archive_mode = 'always'">> oldprim/postgresql.conf
>

With archive_mode = always you can't reproduce it.
It is very rarely people set it to always in production due to the overhead.

> echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
> pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
> psql -p 5432 -c 'create table t(a int)'
> pg_basebackup -D newprim -p 5432
> echo "primary_conninfo='host=/tmp port=5432'">> oldprim/postgresql.conf
> echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
> touch newprim/standby.signal
> pg_ctl -D newprim -o '-p 5433' -l newprim.log start
> pg_ctl -D newprim promote
> for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select
> pg_switch_wal();'; done
> psql -p 5432 -c 'checkpoint'
> pg_ctl -D oldprim stop
>

The archive_mode has to be set to on and the archive_command should be
failing when you do pg_ctl -D oldprim stop

> echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
> # pg_rewind -D oldprim --source-server='port=5433' # fails
> pg_rewind -D oldprim --source-server='port=5433' -c
> for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select
> pg_switch_wal();'; done
> psql -p 5433 -c 'checkpoint'
> echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
> touch oldprim/standby.signal
>
> postgres -D oldprim
>
> > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> segment 000000020000000000000003 has already been removed
>
>
Regards,
--
Alexander Kukushkin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message hubert depesz lubaczewski 2022-08-25 12:44:11 pg_restore deadlocks with itself
Previous Message Kyotaro Horiguchi 2022-08-25 07:49:17 Re: pg_rewind WAL segments deletion pitfall

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-08-25 08:38:41 Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)
Previous Message Kyotaro Horiguchi 2022-08-25 08:25:49 Re: add checkpoint stats of snapshot and mapping files of pg_logical dir