| From: | Mateusz Kosek <mateusz(dot)kosek(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | BUG??/FEATURE REQUEST: Streaming replication walreceiver doesn't restart after standby reboot with recovery_min_apply_delay |
| Date: | 2026-04-03 21:35:42 |
| Message-ID: | CAFyWK5f6r51vGL-YroYQczcwY4zTryCoW_NdAVn=LMSMmCEbrw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
I'm setting up a standby replica that replicates data from the primary in
real-time, but replays WAL logs with a 7-day delay using
recovery_min_apply_delay = '7d'.
The goal is to handle two scenarios:
1. Primary failure: promote the standby (which was streaming continuously)
with minimal recent transaction loss.
2. Accidental data loss (e.g., DROP TABLE, bad DELETE/UPDATE): recover from
the delayed replay within 7 days.
Standby config:
OS: Debian 13.4
PG: 18.1 (built from source)
Base backup:
pg_basebackup -h 10.88.0.12 -p 5432 -D /var/lib/postgresql/18.1/test -U
test_replication --create-slot --slot=test_slot --checkpoint=fast
--progress --wal-method=stream --write-recovery-conf
Added to postgresql.conf:
hot_standby = on
recovery_min_apply_delay = '7d'
Confirmed postgresql.auto.conf has:
primary_conninfo
primary_slot_name
Initial behavior works fine:
walreceiver starts:
ps awux | grep wal
postgres 988251 0.1 0.0 285464 8952 ? Ss 00:00 2:31
postgres: walreceiver streaming 3CC/EFC17820
replay_delay grows as expected:
postgres=# SELECT
pg_last_wal_receive_lsn() AS received_lsn,
pg_last_wal_replay_lsn() AS replayed_lsn,
pg_last_xact_replay_timestamp() AS replay_timestamp,
now() - pg_last_xact_replay_timestamp() AS replay_delay;
received_lsn | replayed_lsn | replay_timestamp |
replay_delay
--------------+--------------+--------------------------------------+-----------------
3CC/EFC28A58 | 3C7/BF8A4CD8 | Fri 03 Apr 00:00:07.278303 2026 CEST |
22:19:24.777788
Standby pg_wal is growing (expected due to delay):
du -h 18.1/test/pg_wal/
4.0K 18.1/test/pg_wal/summaries
988K 18.1/test/pg_wal/archive_status
21G 18.1/test/pg_wal/
Primary stays clean:
du -sch 18.1/main/pg_wal/
33M 18.1/main/pg_wal/
33M total
Problem after standby reboot (e.g., after Debian security updates):
Server starts, but walreceiver does NOT restart.
From xlogrecovery.c state machine:
* Standby mode is implemented by a state machine:
*
* 1. Read from either archive or pg_wal (XLOG_FROM_ARCHIVE), or just
* pg_wal (XLOG_FROM_PG_WAL)
* 2. Check for promotion trigger request
* 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
* 4. Rescan timelines
* 5. Sleep wal_retrieve_retry_interval milliseconds, and loop back to 1.
*
* Failure to read from the current source advances the state machine to
* the next state.
As I understand the code in xlogrecovery.c, the state machine first reads
local pg_wal (XLOG_FROM_PG_WAL), applies the delay, and never reaches the
"Read from primary via walreceiver (XLOG_FROM_STREAM)" step.
Consequences:
Standby stops being a current replica — no new WAL fetched for 7 days until
backlog clears.
Primary pg_wal bloats due to inactive replication slot.
If primary hits WAL retention limit first, slot becomes unusable, requiring
full rebuild.
According to the documentation, it doesn't seem like this is the intended
behavior:
https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-RECOVERY-MIN-APPLY-DELAY
It's also dangerously subtle: initial setup works perfectly, but after
weeks/months and a reboot, replication breaks silently.
Proposed solutions:
Always start walreceiver in background immediately after restart when
hot_standby=on, recovery_min_apply_delay is set,
primary_conninfo/primary_slot_name are present, and no promote signal
exists.
This would restore the exact same behavior as the initial startup:
walreceiver resumes streaming fresh WAL from primary while recovery process
independently replays the existing local pg_wal backlog with the configured
delay (which worked flawlessly for me for 3 weeks before the reboot).
Fallback:
pg_ctl option to start walreceiver on demand (like pg_ctl logrotate cli
option?).
postmaster.c already has CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER)
— just expose as CLI param.
Thoughts? Is this WAI or a bug?
Happy to test patches.
Thanks and best regards,
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Daniel Gustafsson | 2026-04-03 21:46:09 | Re: Changing the state of data checksums in a running cluster |
| Previous Message | Andres Freund | 2026-04-03 20:41:18 | Re: AIO / read stream heuristics adjustments for index prefetching |