Re: pgsql: Prevent invalidation of newly synced replication slots.

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <akapila(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pgsql: Prevent invalidation of newly synced replication slots.
Date: 2026-01-27 17:42:51
Message-ID: CA+TgmobkmUMBocv0hxKiBj_5ZoTRE=qPz7QOu7voJ_j=dWNaaA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Tue, Jan 27, 2026 at 11:37 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > But I wonder if you can't rename("old", "new") where "new" is a file that
> > has already been unlinked (or renamed over) that someone still holds open,
> > or something like that...
>
> I don't see a source of that that would be specific to this test though :(. We
> do wait for pg_basebackup to have shut down, which wrote backup.label (which
> was "manifactured" during streaming by basebackup.c).
>
> Perhaps we should crank up log level in the test? No idea if it'll help, but
> right now I don't even know where to start looking.

I tried sticking a pg_sleep(30) in just before starting the standby
node, and that didn't help, so it doesn't seem like it's a race
condition.

Here's what the standby log file looks like with log_min_messages=DEBUG2:

2026-01-27 17:19:25.262 GMT postmaster[4932] DEBUG: registering
background worker "logical replication launcher"
2026-01-27 17:19:25.264 GMT postmaster[4932] DEBUG: dynamic shared
memory system will support 229 segments
2026-01-27 17:19:25.264 GMT postmaster[4932] DEBUG: created dynamic
shared memory control segment 3769552926 (9176 bytes)
2026-01-27 17:19:25.266 GMT postmaster[4932] DEBUG: max_safe_fds =
990, usable_fds = 1000, already_open = 3
2026-01-27 17:19:25.268 GMT postmaster[4932] LOG: starting PostgreSQL
19devel on x86_64-windows, compiled by msvc-19.29.30159, 64-bit
2026-01-27 17:19:25.271 GMT postmaster[4932] LOG: listening on Unix
socket "C:/Windows/TEMP/3xesO1s4ba/.s.PGSQL.17575"
2026-01-27 17:19:25.273 GMT postmaster[4932] DEBUG: updating PMState
from PM_INIT to PM_STARTUP
2026-01-27 17:19:25.273 GMT postmaster[4932] DEBUG: assigned pm child
slot 57 for io worker
2026-01-27 17:19:25.275 GMT postmaster[4932] DEBUG: assigned pm child
slot 58 for io worker
2026-01-27 17:19:25.277 GMT postmaster[4932] DEBUG: assigned pm child
slot 59 for io worker
2026-01-27 17:19:25.278 GMT postmaster[4932] DEBUG: assigned pm child
slot 56 for checkpointer
2026-01-27 17:19:25.280 GMT postmaster[4932] DEBUG: assigned pm child
slot 55 for background writer
2026-01-27 17:19:25.281 GMT postmaster[4932] DEBUG: assigned pm child
slot 89 for startup
2026-01-27 17:19:25.308 GMT checkpointer[6560] DEBUG: checkpointer
updated shared memory configuration values
2026-01-27 17:19:25.314 GMT startup[2488] LOG: database system was
interrupted; last known up at 2026-01-27 17:19:21 GMT
2026-01-27 17:19:25.317 GMT startup[2488] DEBUG: removing all
temporary WAL segments
The system cannot find the file specified.
2026-01-27 17:19:25.336 GMT startup[2488] DEBUG: could not restore
file "00000002.history" from archive: child process exited with exit
code 1
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG: backup time
2026-01-27 17:19:21 GMT in file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG: backup label
pg_basebackup base backup in file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG: backup timeline 1 in
file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] LOG: starting backup
recovery with redo LSN 0/2A000028, checkpoint LSN 0/2A000080, on
timeline ID 1
The system cannot find the file specified.
2026-01-27 17:19:25.352 GMT startup[2488] DEBUG: could not restore
file "00000001000000000000002A" from archive: child process exited
with exit code 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: checkpoint record is
at 0/2A000080
2026-01-27 17:19:25.353 GMT startup[2488] LOG: entering standby mode
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: redo record is at
0/2A000028; shutdown false
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: next transaction ID:
769; next OID: 24576
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: next MultiXactId: 1;
next MultiXactOffset: 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: oldest unfrozen
transaction ID: 760, in database 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: oldest MultiXactId:
1, in database 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: commit timestamp Xid
oldest/newest: 0/0
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: transaction ID wrap
limit is 2147484407, limited by database with OID 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG: MultiXactId wrap
limit is 2147483648, limited by database with OID 1
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: starting up replication slots
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: xmin required by
slots: data 0, catalog 0
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: starting up
replication origin progress state
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG: didn't need to
unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
2026-01-27 17:19:38.938 GMT startup[2488] FATAL: could not rename
file "backup_label" to "backup_label.old": Permission denied
2026-01-27 17:19:38.983 GMT postmaster[4932] DEBUG: releasing pm child slot 89
2026-01-27 17:19:38.983 GMT postmaster[4932] LOG: startup process
(PID 2488) exited with exit code 1
2026-01-27 17:19:38.983 GMT postmaster[4932] LOG: aborting startup
due to startup process failure
2026-01-27 17:19:38.983 GMT postmaster[4932] DEBUG: cleaning up
dynamic shared memory control segment with ID 3769552926
2026-01-27 17:19:38.985 GMT postmaster[4932] LOG: database system is shut down

Unfortunately, I don't see any clues there. The "The system cannot
find the file specified." messages look like they might be a clue, but
I think they are not, because they also occur in
040_standby_failover_slots_sync_standby1.log, and that test passes. At
the point where this log file shows the FATAL error, that log file
continues thus:

2026-01-27 17:18:36.905 GMT startup[1420] DEBUG: resetting unlogged
relations: cleanup 1 init 0
2026-01-27 17:18:36.906 GMT startup[1420] DEBUG: initializing for hot standby
2026-01-27 17:18:36.906 GMT startup[1420] LOG: redo starts at 0/02000028
2026-01-27 17:18:36.906 GMT startup[1420] DEBUG: recovery snapshots
are now enabled
2026-01-27 17:18:36.906 GMT startup[1420] CONTEXT: WAL redo at
0/02000048 for Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid
768 oldestRunningXid 769
2026-01-27 17:18:36.907 GMT startup[1420] DEBUG: end of backup record reached
2026-01-27 17:18:36.907 GMT startup[1420] CONTEXT: WAL redo at
0/02000100 for XLOG/BACKUP_END: 0/02000028
2026-01-27 17:18:36.907 GMT startup[1420] DEBUG: end of backup reached

Which again seems totally normal.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Robert Haas 2026-01-27 18:15:47 Re: pgsql: Prevent invalidation of newly synced replication slots.
Previous Message Robert Haas 2026-01-27 17:11:23 Re: pgsql: Prevent invalidation of newly synced replication slots.

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Chavez 2026-01-27 17:46:41 Re: Add SECURITY_INVOKER_VIEWS option to CREATE DATABASE
Previous Message Jingtang Zhang 2026-01-27 17:41:52 Optimize CPU usage of dropping buffers during recovery