Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary

From: Jorge Daniel <elgaita(at)hotmail(dot)com>
To: "pgsql-admin(at)lists(dot)postgresql(dot)org" <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary
Date: 2026-06-04 21:13:45
Message-ID: A05772F4-7C72-4151-9562-7C1A84A886E2@hotmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Good day to everyone

We're asking the PG-comunity for some help if it is possible.

We have a primary with 2 secondaries: the primary went down and one of the secondaries was promoted. The orphaned Secondary reconnected to the new primary and is replicating ok.
We had to reconstruct a new secondary, we did it as we always do with the basic and dependable:

pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=fast --create-slot --slot=us_vento_replica_slot_aux

It ran for 5hrs without problem. When it finished:
$ pg_ctl start

The recovery was running until a consistent point was reached, the database opened and started streaming the rest from the Primary.
After an hour or so, the recovery got stuck in a certain wal file. No more log entries about it (debug 1) and after some hours the stream connection got disconnected.
The secondary is hung on that wal file and not going forward with the rest of the wal file list.

We re-tried this several times, changing the storage (just in case), with a new box with the same original Ubuntu 22.04 instead of 24.04 (just in case), but the result was the same.
Even though we have the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the same file (everytime we re-created the stuck-wal-file changed, clearly).
We're out of ideas of what's happening.
Could you please shed some light here?

Primary: uspgvento14r

Version
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1)

pg_stat_replication:

pid | 4177197
usesysid | 16474
usename | replicator
application_name | 14_replica
client_addr | 192.168.11.33
client_hostname |
client_port | 53534
backend_start | 2026-06-04 01:25:18.100137-07
backend_xmin |
state | streaming
sent_lsn | 8EBA/CDA68930
write_lsn | 8EBA/CDA68930
flush_lsn | 8EBA/CDA68930
replay_lsn | 8EB7/968FEBE8
write_lag | 00:00:00.001543
flush_lag | 00:00:00.00225
replay_lag | 01:00:22.530041
sync_priority | 0
sync_state | async
reply_time | 2026-06-04 03:33:28.037703-07

Log Primary

2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG: connection received: host=192.168.12.34 port=53534
2026-06-04 01:25:18 PDT [2858] LOG: background worker "logical replication worker" (PID 4177182) exited with exit code 1
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: connection authenticated: identity="replicator" method=md5 (/etc/postgresql/14/main/pg_hba.conf:95)
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: replication connection authorized: user=replicator application_name=14_uspgvento14Rb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
--
2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG: disconnection: session time: 2:37:08.575 user=replicator database= host=192.168.12.34 port=53534

Secondary : 14_uspgvento14Rb 192.168.12.34

Version:
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1)

Ubuntu 14.23-1.pgdg24.04+1

.....
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002D" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002E" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002F" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA400000030" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.011 ms
2026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.493 ms
......
2026-06-04 08:25:18 UTC [197740] LOG: restored log file "0000000200008EA400000044" from archive
2026-06-04 08:25:18 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:18 UTC [197740] DEBUG: end of backup reached
2026-06-04 08:25:18 UTC [197740] CONTEXT: WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578
2026-06-04 08:25:18 UTC [197740] LOG: consistent recovery state reached at 8EA4/44680C30
2026-06-04 08:25:18 UTC [197738] LOG: database system is ready to accept read-only connections
cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No such file or directory
2026-06-04 08:25:18 UTC [207957] LOG: started streaming WAL from primary at 8EA4/45000000 on timeline 2
2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG: connection received: host=[local]
2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG: connection authorized: user=postgres database=postgres application_name=psql
2026-06-04 08:26:15 UTC [197747] LOG: restartpoint starting: time
2026-06-04 08:26:15 UTC [197747] DEBUG: performing replication slot checkpoint
......
22026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.006 ms
2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.572 ms
2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=6 file=base/6176124/6602460.56 time=1.791 ms
.....
2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=782 file=base/6176124/6601067.15 time=0.003 ms
2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=783 file=base/6176124/6374431.10 time=0.006 ms
2026-06-04 09:35:15 UTC [197747] LOG: restartpoint complete: wrote 99500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=239.934 s, sync=0.039 s, total=239.989 s; sync files=783, longest=0.001 s, average=0.001 s; distance=250897 kB, estimate=13117629 kB
2026-06-04 09:35:15 UTC [197747] LOG: recovery restart point at 8EB7/7550F228
2026-06-04 09:35:15 UTC [197747] DETAIL: Last completed transaction was at log time 2026-06-04 02:33:05.505911-07.
^(at)^@^(at)^@^(at)^@^(at)^@——> Forever<———

postgres=# select pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
8EB7/968FEBE8

postgres(at)uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal
postgres 207957 197738 6 08:25 ? 00:04:52 postgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8
postgres 211309 163479 0 09:36 pts/12 00:00:00 grep wal
postgres(at)uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco
postgres 197740 197738 16 08:11 ? 00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0000000200008EB700000096
postgres 211311 163479 0 09:36 pts/12 00:00:00 grep reco

Content of the wal :

postgres(at)uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |grep 8FEBE8
rmgr: MultiXact len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh)
rmgr: Heap len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258

There are still plenty of files to process:
postgres(at)uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000091
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000092
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000093
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000094
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000095
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000096
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000097
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000098
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000099
-rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009A
-rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009B

Postgresql.conf Secondary (primary is the same, but with more memory)

All non-default values:

shared_buffers = 4GB # min 128kB
temp_buffers = 512MB # min 800kB
work_mem = 512MB # min 64kB
maintenance_work_mem = 4GB # min 1MB
autovacuum_work_mem = 512MB # min 1MB, or -1 to use maintenance_work_mem
max_stack_depth = 7MB
bgwriter_lru_maxpages = 1000 # max buffers written/round, 0 disables

wal_level = logical # minimal, replica, or logical
wal_log_hints = on # also do full page writes of non-critical updates

checkpoint_completion_target = 0.8 # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 600s # 0 disables
max_wal_size = 500GB
min_wal_size = 50GB

max_wal_senders = 10 # max number of walsender processes
max_replication_slots = 30 # max number of replication slots
wal_keep_size = 150GB # in megabytes; 0 disables 30720=30GB 1.5 day
max_slot_wal_keep_size = 500GB # in megabytes; -1 disables 307200=300GB 4 days
wal_sender_timeout = 600s # in milliseconds; 0 disables

max_standby_archive_delay = 48h # max delay before canceling queries
max_standby_streaming_delay = 48h # max delay before canceling queries
hot_standby_feedback = on # send info from standby to prevent
wal_receiver_timeout = 600s # time that receiver waits for

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message hello from Burnside Project 2026-06-04 23:51:25 Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary
Previous Message Tom Lane 2026-06-04 20:15:11 Re: Howto tell pg_ctl to use a non-default directory for config files