Re: BUG: Former primary node might stuck when started as a standby

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BUG: Former primary node might stuck when started as a standby
Date: 2026-02-05 05:00:01
Message-ID: 180bf13e-235d-46a9-9788-7d6cee40bcdc@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

22.01.2024 13:00, Aleksander Alekseev wrote:
>> Perhaps that's true generally, but as we can see, without the extra
>> records replayed, this scenario works just fine. Moreover, existing tests
>> rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
>> research of the issue was initiated per a test failure).
> I suggest focusing on particular flaky tests then and how to fix them.

For the record: skink produced the same failure with 009_twophase.pl
yesterday [1]:

366/366 postgresql:recovery / recovery/009_twophase TIMEOUT        15000.31s   killed by signal 15 SIGTERM

[15:11:30.591](0.807s) ok 11 - Restore of prepared transaction on promoted standby
### Enabling streaming replication for node "london"
### Starting node "london"
# Running: pg_ctl --wait --pgdata
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata
--log /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log
--options --cluster-name=london start
waiting for server to start....... done
server started
# Postmaster PID for node "london" is 4181556
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
# Postmaster PID for node "london" is 4181556
### Stopping node "london" using mode immediate
# Running: pg_ctl --pgdata
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata
--mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "london"
# Postmaster PID for node "paris" is 4147391
### Stopping node "paris" using mode immediate
# Running: pg_ctl --pgdata
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_paris_data/pgdata
--mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "paris"
[19:19:46.999](14896.408s) # Tests were run but no plan was declared and done_testing() was not seen.
[19:19:46.999](0.000s) # Looks like your test exited with 4 just after 11.

pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log
2026-02-04 15:11:35.675 CET [4181556][postmaster][:0] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-15.2.0, 64-bit
2026-02-04 15:11:35.692 CET [4181556][postmaster][:0] LOG: listening on Unix socket "/tmp/IzuUID03TM/.s.PGSQL.22911"
2026-02-04 15:11:35.904 CET [4183900][startup][:0] LOG:  database system was interrupted; last known up at 2026-02-04
15:11:14 CET
2026-02-04 15:11:36.017 CET [4183900][startup][:0] LOG:  entering standby mode
2026-02-04 15:11:36.019 CET [4183900][startup][:0] LOG:  database system was not properly shut down; automatic recovery
in progress
2026-02-04 15:11:36.104 CET [4183900][startup][38/0:0] LOG:  redo starts at 0/03021600
2026-02-04 15:11:36.206 CET [4183900][startup][38/0:0] LOG:  invalid record length at 0/030220B8: expected at least 24,
got 0
2026-02-04 15:11:36.209 CET [4183900][startup][38/0:0] LOG: consistent recovery state reached at 0/030220B8
2026-02-04 15:11:36.210 CET [4181556][postmaster][:0] LOG:  database system is ready to accept read-only connections
2026-02-04 15:11:36.919 CET [4184142][walreceiver][:0] LOG: fetching timeline history file for timeline 2 from primary
server
2026-02-04 15:11:36.997 CET [4184142][walreceiver][:0] LOG:  started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] LOG: replication terminated by primary server
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] DETAIL:  End of WAL reached on timeline 1 at 0/03022070.
2026-02-04 15:11:37.098 CET [4183900][startup][38/0:0] LOG:  new timeline 2 forked off current database system timeline
1 before current recovery point 0/030220B8
2026-02-04 15:11:37.103 CET [4184142][walreceiver][:0] LOG: restarted WAL streaming at 0/03000000 on timeline 1
2026-02-04 15:11:37.107 CET [4184142][walreceiver][:0] LOG: replication terminated by primary server
...

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check

Best regards,
Alexander

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-02-05 05:02:08 Re: pg_upgrade: fix memory leak in SLRU I/O code
Previous Message Corey Huinker 2026-02-05 04:42:16 Re: Add expressions to pg_restore_extended_stats()