pgsql: Fix race condition in 028_pitr_timelines.pl test, add note to do

From: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Fix race condition in 028_pitr_timelines.pl test, add note to do
Date: 2022-02-15 23:42:04
Message-ID: E1nK7SO-0000nq-0G@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix race condition in 028_pitr_timelines.pl test, add note to docs.

The 028_pitr_timelines.pl test would sometimes hang, waiting for a WAL
segment that was just filled up to be archived. It was because the
test used 'pg_stat_archiver.last_archived_wal' to check if a file was
archived, but the order that WAL files are archived when a standby is
promoted is not fully deterministic, and 'last_archived_wal' tracks
the last segment that was archived, not the highest-numbered WAL
segment. Because of that, if the archiver archived segment 3, and then
2, 'last_archived_wal' say 2, and the test query would think that 3
has not been archived yet.

Normally, WAL files are marked ready for archival in order, and the
archiver process will process them in order, so that issue doesn't
arise. We have used the same query on 'last_archived_wal' in a few
other tests with no problem. But when a standby is promoted, things
are a bit chaotic. After promotion, the server will try to archive all
the WAL segments from the old timeline that are in pg_wal, as well as
the history file and any new WAL segments on the new timeline. The
end-of-recovery checkpoint will create the .ready files for all the
WAL files on the old timeline, but at the same time, the new timeline
is opened up for business. A file from the new timeline can therefore
be archived before the files from the old timeline have been marked as
ready for archival.

It turns out that we don't really need to wait for the archival in
this particular test, because the standby server is about to be
stopped, and stopping a server will wait for the end-of-recovery
checkpoint and all WAL archivals to finish, anyway. So we can just
remove it from the test.

Add a note to the docs on 'pg_stat_archiver' view that files can be
archived out of order.

Reviewed-by: Tom Lane
Discussion: https://www.postgresql.org/message-id/3186114.1644960507@sss.pgh.pa.us

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/853c6400bf5defda139dc36387ce761398011799

Modified Files
--------------
doc/src/sgml/monitoring.sgml | 17 +++++++++++++----
src/test/recovery/t/028_pitr_timelines.pl | 13 ++-----------
2 files changed, 15 insertions(+), 15 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Heikki Linnakangas 2022-02-15 23:42:27 Re: last_archived_wal is not necessary the latest WAL file (was Re: pgsql: Add test case for an archive recovery corner case.)
Previous Message Peter Geoghegan 2022-02-15 23:17:17 pgsql: Update "don't truncate with failsafe" rationale.