| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline |
| Date: | 2026-06-06 09:00:00 |
| Message-ID: | 7daef094-abf3-4672-bc23-3df4763b16a3@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello hackers,
I'd like to present a recent failure of 035_standby_logical_decoding on
buildfarm [1]:
### Promoting node "standby"
# Running: pg_ctl --pgdata
/home/bf/bf-build/scorpion/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata
--log
/home/bf/bf-build/scorpion/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log
promote
waiting for server to promote.... done
server promoted
Waiting for replication conn cascading_standby's replay_lsn to pass 0/0648A290 on standby
done
[10:40:16.996](0.446s) ok 84 - got expected output from SQL decoding session on promoted standby
[10:40:16.997](0.001s) # pump_until: process terminated unexpectedly when searching for "(?^s:^.*COMMIT.*COMMIT$)" with
stream: ""
[10:40:16.997](0.000s) not ok 85 - got 2 COMMIT from pg_recvlogical output
[10:40:16.997](0.000s) # Failed test 'got 2 COMMIT from pg_recvlogical output'
# at /home/bf/bf-build/scorpion/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 1036.
[10:40:16.997](0.000s) not ok 86 - got same expected output from pg_recvlogical decoding session
[10:40:16.997](0.000s) # Failed test 'got same expected output from pg_recvlogical decoding session'
# at /home/bf/bf-build/scorpion/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 1040.
[10:40:16.998](0.000s) # got: ''
# expected: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
# table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
# table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
# table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
# COMMIT
# BEGIN
# table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
# table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
# table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
# COMMIT'
[10:40:17.009](0.011s) ok 87 - got expected output from SQL decoding session on cascading standby
[10:40:17.009](0.000s) ok 88 - got 2 COMMIT from pg_recvlogical output
035_standby_logical_decoding_standby.log contains:
2026-06-03 10:40:16.745 CEST [3955206][startup][29/0:0] LOG: received promote request
...
2026-06-03 10:40:16.784 CEST [3955206][startup][29/0:0] LOG: archive recovery complete
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] LOG: starting logical decoding for slot "promotion_activeslot"
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] DETAIL: Streaming transactions committing after 0/06487A20,
reading WAL from 0/064879E8.
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] STATEMENT: START_REPLICATION SLOT "promotion_activeslot"
LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] ERROR: requested WAL segment 000000010000000000000006 has
already been removed
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] STATEMENT: START_REPLICATION SLOT "promotion_activeslot"
LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] LOG: released logical replication slot "promotion_activeslot"
2026-06-03 10:40:16.862 CEST [3963805][walsender][:0] LOG: disconnection: session time: 0:00:00.212 user=bf
database=testdb host=[local]
...
That is, walsender requested WAL segment for timeline 1, while in a
successful run, it reads WAL for timeline 2.
I've managed to reproduce this failure with:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1502,2 +1502,3 @@ StartLogicalReplication(StartReplicationCmd *cmd)
ReplicationSlotAcquire(cmd->slotname, true, true);
+pg_usleep(200000);
Being executed in a loop, the test fails for me within 100 iterations
(within 10 on a slower VM):
ITERATION 23
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 84/? # pump_until: process terminated unexpectedly when searching for
"(?^s:^.*COMMIT.*COMMIT$)" with stream: ""
# Failed test 'got 2 COMMIT from pg_recvlogical output'
# at t/035_standby_logical_decoding.pl line 1038.
# Failed test 'got same expected output from pg_recvlogical decoding session'
# at t/035_standby_logical_decoding.pl line 1042.
# got: ''
# expected: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
...
As far as I can see, the timeline is chosen in logical_read_xlog_page()
depending on the recovery state:
am_cascading_walsender = RecoveryInProgress();
if (am_cascading_walsender)
GetXLogReplayRecPtr(&currTLI);
else
currTLI = GetWALInsertionTimeLine();
With diagnostic logging added here, a failed run shows:
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl LOG: starting logical decoding for slot
"promotion_activeslot"
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl DETAIL: Streaming transactions committing
after 0/0647C7E8, reading WAL from 0/0647C7B0.
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl LOG: !!!logical_read_xlog_page|
am_cascading_walsender: 1, currTLI: 1
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000010000000000000006
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl ERROR: requested WAL segment
000000010000000000000006 has already been removed
In a successful run, am_cascading_walsender/RecoveryInProgress() == false and currTLI == 2:
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG: starting logical decoding for slot
"promotion_activeslot"
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl DETAIL: Streaming transactions committing
after 0/0647C7E8, reading WAL from 0/0647C7B0.
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG: !!!logical_read_xlog_page|
am_cascading_walsender: 0, currTLI: 2
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000020000000000000006
Reproduced at the commit fcd77d532, which added the test.
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2026-06-03%2008%3A31%3A43
Best regards,
Alexander
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Langote | 2026-06-06 09:13:15 | Re: PG19 FK fast path: OOB write and missed FK checks during batched |
| Previous Message | Nikolay Samokhvalov | 2026-06-06 08:30:51 | PG19 FK fast path: OOB write and missed FK checks during batched |