Re: Implement waiting for wal lsn replay: reloaded

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: Implement waiting for wal lsn replay: reloaded
Date: 2026-05-19 20:00:00
Message-ID: 63f6abc9-c0ae-465d-a4e6-667eca6ea008@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Alexander and Xuneng,

06.04.2026 22:49, Alexander Korotkov wrote:
> Thank you, I've pushed your version of patchset. I made two minor
> corrections for patch #2: mention default mode value in the header
> comment, and fallback to polling on has_wal_read_bug sparc64+ext4 bug.

I discovered a new test failure, that is apparently caused by new
wait_for_catchup() implementation [1]:
[06:20:23.110](1.069s) not ok 8 - check that the slot state changes to "extended"
[06:20:23.110](0.001s) #   Failed test 'check that the slot state changes to "extended"'
#   at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 140.
[06:20:23.111](0.000s) #          got: 'unreserved'
#     expected: 'extended'
[06:20:23.231](0.120s) not ok 9 - check that the slot state changes to "unreserved"
[06:20:23.231](0.000s) #   Failed test 'check that the slot state changes to "unreserved"'
#   at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 152.
[06:20:23.231](0.000s) #          got: 'lost|'
#     expected: 'unreserved|t'

I've managed to reproduce such failures with:
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 07eac07b9ce..493ce92674e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1143,2 +1143,3 @@ XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply)

+pg_usleep(10000);
     /* Get current timestamp. */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 04aa770d981..19cda3a6b51 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2521,2 +2521,3 @@ ProcessStandbyReplyMessage(void)

+pg_usleep(100000);
     /* the caller already consumed the msgtype byte */

Concretely, a loop:
for i in {1..100}; do echo "ITERATION $i"; PROVE_TESTS="t/019*" make -s check -C src/test/recovery/ || break; done
failed for me on iterations 2, 1, 7:
ITERATION 7
# +++ tap check in src/test/recovery +++
t/019_replslot_limit.pl .. 8/?
#   Failed test 'check that the slot state changes to "extended"'
#   at t/019_replslot_limit.pl line 140.
#          got: 'unreserved'
#     expected: 'extended'
t/019_replslot_limit.pl .. 26/? # Looks like you failed 1 test of 26.
t/019_replslot_limit.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/26 subtests

With "WAIT FOR LSN" in wait_for_catchup() disabled, 100 iterations
passed.

Having extra logging added, I could see the key difference.
Failed run:
2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl LOG:  !!!GetWALAvailability| targetLSN:
0/016C0000, targetSeg: 22, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 22
2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl STATEMENT:  SELECT wal_status FROM
pg_replication_slots WHERE slot_name = 'rep1'
vs
Successful run:
2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl LOG:  !!!GetWALAvailability| targetLSN:
0/01700000, targetSeg: 23, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 23
2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl STATEMENT:  SELECT wal_status FROM
pg_replication_slots WHERE slot_name = 'rep1'

That is, with WAIT FOR LSN, primary in this test may advance
slot->data.restart_lsn to the expected position after wait_for_catchup()
returns.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=goldfish&dt=2026-05-13%2006%3A15%3A03

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2026-05-19 20:37:56 [PATCH] Fix LISTEN startup race with direct advancement
Previous Message Alvaro Herrera 2026-05-19 18:52:13 Re: Adding REPACK [concurrently]