Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary

From: hello from Burnside Project <hello(at)burnsideproject(dot)ai>
To: "Jorge Daniel" <elgaita(at)hotmail(dot)com>
Cc: "pgsql-admin(at)lists(dot)postgresql(dot)org" <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary
Date: 2026-06-04 23:51:25
Message-ID: 19e950c8337.61bdc2a03840616.4821016053268419825@burnsideproject.ai
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Replication status

On the primary:

SELECT * FROM pg_stat_replication;

Shows:

* Is the standby connected?

* What LSN has been sent?

* What LSN has been replayed?

2. WAL receiver status

On the standby:

SELECT * FROM pg_stat_wal_receiver;

Shows:

* Is WAL still being received?

* From which server?

* Latest WAL location received?

3. Recovery progress

On the standby:

SELECT

    pg_last_wal_receive_lsn(),

    pg_last_wal_replay_lsn(),

    pg_last_xact_replay_timestamp();

   

This tells you whether:

* WAL is arriving but not replaying.

* WAL is replaying slowly.

* Replay has completely stopped.

4. PostgreSQL logs

Look for messages such as:

invalid record length

PANIC

could not read WAL

requested timeline

waiting for WAL

Best regards, 

Arif Rahman 

mailto:arif(dot)rahman(at)burnsideproject(dot)ai  

https://burnsideproject.ai  

https://github.com/orgs/burnside-project  

Capture, transform, and learn from PostgreSQL

Email Disclaimer:

This email and any attachments may contain privileged and confidential information intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying, or use of this email or its contents is strictly prohibited. If you have received this email in error, please notify the sender immediately by replying to this message and delete it from your system.

Please note that any views or opinions expressed in this email are solely those of the author and do not necessarily represent those of Burnside Project LLC. Although we have taken precautions to ensure this email is free of viruses or other malicious software, we cannot guarantee the security or integrity of email communications. Recipients should verify attachments for possible threats.

Thank you.

Burnside Project

From: Jorge Daniel <elgaita(at)hotmail(dot)com>
To: "pgsql-admin(at)lists(dot)postgresql(dot)org"<pgsql-admin(at)lists(dot)postgresql(dot)org>
Date: Thu, 04 Jun 2026 14:13:45 -0700
Subject: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary

Good day to everyone

We're asking the PG-comunity for some help if it is possible.

We have a primary with 2 secondaries: the primary went down and one of the secondaries was promoted. The orphaned Secondary reconnected to the new primary and is replicating ok.
We had to reconstruct a new secondary, we did it as we always do with the basic and dependable:

pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=fast --create-slot --slot=us_vento_replica_slot_aux

It ran for 5hrs without problem. When it finished:
$ pg_ctl start

The recovery was running until a consistent point was reached, the database opened and started streaming the rest from the Primary.
After an hour or so, the recovery got stuck in a certain wal file. No more log entries about it (debug 1) and after some hours the stream connection got disconnected.
The secondary is hung on that wal file and not going forward with the rest of the wal file list.

We re-tried this several times, changing the storage (just in case), with a new box with the same original Ubuntu 22.04 instead of 24.04 (just in case), but the result was the same.
Even though we have the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the same file (everytime we re-created the stuck-wal-file changed, clearly).
We're out of ideas of what's happening.
Could you please shed some light here?


Primary: uspgvento14r

Version
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1)

pg_stat_replication:

pid | 4177197
usesysid | 16474
usename | replicator
application_name | 14_replica
client_addr | 192.168.11.33
client_hostname |
client_port | 53534
backend_start | 2026-06-04 01:25:18.100137-07
backend_xmin |
state | streaming
sent_lsn | 8EBA/CDA68930
write_lsn | 8EBA/CDA68930
flush_lsn | 8EBA/CDA68930
replay_lsn | 8EB7/968FEBE8
write_lag | 00:00:00.001543
flush_lag | 00:00:00.00225
replay_lag | 01:00:22.530041
sync_priority | 0
sync_state | async
reply_time | 2026-06-04 03:33:28.037703-07

Log Primary

2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG: connection received: host=192.168.12.34 port=53534
2026-06-04 01:25:18 PDT [2858] LOG: background worker "logical replication worker" (PID 4177182) exited with exit code 1
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: connection authenticated: identity="replicator" method=md5 (/etc/postgresql/14/main/pg_hba.conf:95)
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: replication connection authorized: user=replicator application_name=14_uspgvento14Rb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
--
2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG: disconnection: session time: 2:37:08.575 user=replicator database= host=192.168.12.34 port=53534




Secondary : 14_uspgvento14Rb 192.168.12.34

Version:
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1)

Ubuntu 14.23-1.pgdg24.04+1


.....
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002D" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002E" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002F" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA400000030" from archive
2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive
026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.011 ms
2026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.493 ms
......
2026-06-04 08:25:18 UTC [197740] LOG: restored log file "0000000200008EA400000044" from archive
2026-06-04 08:25:18 UTC [197740] DEBUG: got WAL segment from archive
2026-06-04 08:25:18 UTC [197740] DEBUG: end of backup reached
2026-06-04 08:25:18 UTC [197740] CONTEXT: WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578
2026-06-04 08:25:18 UTC [197740] LOG: consistent recovery state reached at 8EA4/44680C30
2026-06-04 08:25:18 UTC [197738] LOG: database system is ready to accept read-only connections
cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No such file or directory
2026-06-04 08:25:18 UTC [207957] LOG: started streaming WAL from primary at 8EA4/45000000 on timeline 2
2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG: connection received: host=[local]
2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG: connection authorized: user=postgres database=postgres application_name=psql
2026-06-04 08:26:15 UTC [197747] LOG: restartpoint starting: time
2026-06-04 08:26:15 UTC [197747] DEBUG: performing replication slot checkpoint
......
22026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.006 ms
2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.572 ms
2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=6 file=base/6176124/6602460.56 time=1.791 ms
.....
2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=782 file=base/6176124/6601067.15 time=0.003 ms
2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=783 file=base/6176124/6374431.10 time=0.006 ms
2026-06-04 09:35:15 UTC [197747] LOG: restartpoint complete: wrote 99500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=239.934 s, sync=0.039 s, total=239.989 s; sync files=783, longest=0.001 s, average=0.001 s; distance=250897 kB, estimate=13117629 kB
2026-06-04 09:35:15 UTC [197747] LOG: recovery restart point at 8EB7/7550F228
2026-06-04 09:35:15 UTC [197747] DETAIL: Last completed transaction was at log time 2026-06-04 02:33:05.505911-07.
^(at)^@^(at)^@^(at)^@^(at)^@——> Forever<———    





postgres=# select pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
8EB7/968FEBE8


postgres(at)uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal
postgres 207957 197738 6 08:25 ? 00:04:52 postgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8
postgres 211309 163479 0 09:36 pts/12 00:00:00 grep wal
postgres(at)uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco
postgres 197740 197738 16 08:11 ? 00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0000000200008EB700000096
postgres 211311 163479 0 09:36 pts/12 00:00:00 grep reco



Content of the wal :

postgres(at)uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |grep 8FEBE8
rmgr: MultiXact len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh)
rmgr: Heap len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258


There are still plenty of files to process:
postgres(at)uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000091
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000092
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000093
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000094
-rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000095
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000096
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000097
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000098
-rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000099
-rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009A
-rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009B


Postgresql.conf Secondary (primary is the same, but with more memory)

All non-default values:

shared_buffers = 4GB # min 128kB
temp_buffers = 512MB # min 800kB
work_mem = 512MB # min 64kB
maintenance_work_mem = 4GB # min 1MB
autovacuum_work_mem = 512MB # min 1MB, or -1 to use maintenance_work_mem
max_stack_depth = 7MB
bgwriter_lru_maxpages = 1000 # max buffers written/round, 0 disables

wal_level = logical # minimal, replica, or logical
wal_log_hints = on # also do full page writes of non-critical updates

checkpoint_completion_target = 0.8 # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 600s # 0 disables
max_wal_size = 500GB
min_wal_size = 50GB

max_wal_senders = 10 # max number of walsender processes
max_replication_slots = 30 # max number of replication slots
wal_keep_size = 150GB # in megabytes; 0 disables 30720=30GB 1.5 day
max_slot_wal_keep_size = 500GB # in megabytes; -1 disables 307200=300GB 4 days
wal_sender_timeout = 600s # in milliseconds; 0 disables

max_standby_archive_delay = 48h # max delay before canceling queries
max_standby_streaming_delay = 48h # max delay before canceling queries
hot_standby_feedback = on # send info from standby to prevent
wal_receiver_timeout = 600s # time that receiver waits for

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Raj 2026-06-05 03:47:53 Re: Low TPS
Previous Message Jorge Daniel 2026-06-04 21:13:45 Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary