| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | 035_standby_logical_decoding might fail due to FATAL message lost inside libpq |
| Date: | 2026-05-17 15:00:00 |
| Message-ID: | a50ff89b-fb2c-4912-8f89-50dd0bd0916b@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello hackers,
A recent buildfarm failure [1]:
38/273 recovery - postgresql:recovery/035_standby_logical_decoding ERROR 74.36s exit status 1
[14:34:25.522](0.001s) ok 65 - pg_recvlogical exited non-zero
[14:34:25.522](0.000s) not ok 66 - slot has been invalidated
[14:34:25.523](0.000s) # Failed test 'slot has been invalidated'
# at /home/bf/bf-build/francolin/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 118.
[14:34:25.523](0.001s) # 'pg_recvlogical: error: could not send replication command "START_REPLICATION
SLOT "drop_db_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": server closed the connection
unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# pg_recvlogical: error: disconnected
# '
# doesn't match '(?^:conflict with recovery)'
[14:34:25.675](0.152s) ok 67 - otherslot on standby not dropped
duplicated one that happened the last year [2] and had no explanation so
far. This time I've managed to reproduce it with the check_slots_dropped()
repeated in the loop (the patch is attached).
With extra logging added, I could see:
[08:47:52.385](0.000s) not ok 114 - slot has been invalidated
[08:47:52.385](0.000s) # Failed test 'slot has been invalidated'
# at /home/user/postgresql/src/test/recovery_9/t/035_standby_logical_decoding.pl line 120.
[08:47:52.385](0.000s) # '!!!pqsecure_raw_read| len=16384 n=554 errno=0
buf=760000002b00030002000000015f70715f2e746573745f70726f746f636f6c5f6e65676f74696174696f6e00520000000800000000530000001b496e74657276616c5374796c6500706f7374677265730053000000207365617263685f7061746800222475736572222c207075626c696300530000001469735f737570657275736572006f6e0053000000237374616e646172645f636f6e666f726d696e675f737472696e6773006f6e00530000001f73657373696f6e5f617574686f72697a6174696f6e0075736572005300000019636c69656e745f656e636f64696e67005554463800530000001b7365727665725f76657273696f6e003139646576656c
# !!!pqsecure_raw_read| len=16384 n=85 errno=0
buf=540000002300017365745f636f6e6669670000000000000000000019ffffffffffff0000440000000a000100000000430000000d53454c45435420310053000000117365617263685f7061746800005a0000000549
# !!!pqsecure_raw_read| len=16384 n=76 errno=0
buf=540000002c0001646174615f6469726563746f72795f6d6f64650000000000000000000019ffffffffffff0000440000000e00010000000430373030430000000953484f57005a0000000549
# !!!pqsecure_raw_read| len=16384 n=198 errno=0
buf=540000006f000473797374656d69640000000000000000000019ffffffffffff000074696d656c696e6500000000000000000000140008ffffffff0000786c6f67706f730000000000000000000019ffffffffffff000064626e616d650000000000000000000019ffffffffffff0000440000003a0004000000133736343038333834313938393731313334343000000001310000000a302f30373138433739380000000674657374646243000000144944454e544946595f53595354454d005a0000000549
# !!!pqsecure_raw_read| len=16384 n=181 errno=0
buf=45000000b453464154414c0056464154414c00433537503034004d7465726d696e6174696e6720636f6e6e656374696f6e2064756520746f20636f6e666c6963742077697468207265636f766572790044557365722077617320636f6e6e656374656420746f20612064617461626173652074686174206d7573742062652064726f707065642e0046706f7374677265732e63004c3332313300527265706f72745f7265636f766572795f636f6e666c6963740000
### FATAL...57P04...terminating connection due to conflict with recovery
# !!!pqsecure_raw_read| len=16384 n=0 errno=0 buf=
# !!!pqsecure_raw_read| len=16384 n=0 errno=0 buf=
# pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "drop_db_activeslot" LOGICAL
0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')": server closed the connection unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# pg_recvlogical: error: disconnected
# '
# doesn't match '(?^:conflict with recovery)'
That is, the FATAL message was received from the server, but libpq lost it inside.
A successful run shows this:
!!!check_pg_recvlogical_stderr:
!!!pqsecure_raw_read| len=16384 n=554 errno=0
buf=760000002b00030002000000015f70715f2e746573745f70726f746f636f6c5f6e65676f74696174696f6e00520000000800000000530000001b496e74657276616c5374796c6500706f7374677265730053000000207365617263685f7061746800222475736572222c207075626c696300530000001469735f737570657275736572006f6e0053000000237374616e646172645f636f6e666f726d696e675f737472696e6773006f6e00530000001f73657373696f6e5f617574686f72697a6174696f6e0075736572005300000019636c69656e745f656e636f64696e67005554463800530000001b7365727665725f76657273696f6e003139646576656c
!!!pqsecure_raw_read| len=16384 n=85 errno=0
buf=540000002300017365745f636f6e6669670000000000000000000019ffffffffffff0000440000000a000100000000430000000d53454c45435420310053000000117365617263685f7061746800005a0000000549
!!!pqsecure_raw_read| len=16384 n=76 errno=0
buf=540000002c0001646174615f6469726563746f72795f6d6f64650000000000000000000019ffffffffffff0000440000000e00010000000430373030430000000953484f57005a0000000549
!!!pqsecure_raw_read| len=16384 n=198 errno=0
buf=540000006f000473797374656d69640000000000000000000019ffffffffffff000074696d656c696e6500000000000000000000140008ffffffff0000786c6f67706f730000000000000000000019ffffffffffff000064626e616d650000000000000000000019ffffffffffff0000440000003a0004000000133736343038333834313938393731313334343000000001310000000a302f30364432393539380000000674657374646243000000144944454e544946595f53595354454d005a0000000549
!!!pqsecure_raw_read| len=16384 n=8 errno=0 buf=5700000007000000
!!!pqsecure_raw_read| len=16384 n=46 errno=0
buf=64000000166b0000000006d295980002f5019f6590590064000000166b0000000006d295980002f5019f6590c100
!!!pqsecure_raw_read| len=16384 n=46 errno=0
buf=64000000166b0000000006d295980002f5019f66df5b0064000000166b0000000006d295d00002f5019f66e01d00
!!!pqsecure_raw_read| len=16384 n=46 errno=0
buf=64000000166b0000000006d295d00002f5019f6719390064000000166b0000000006d296a00002f5019f67197c00
!!!pqsecure_raw_read| len=16384 n=23 errno=0 buf=64000000166b0000000006d296a00002f5019f671b3300
!!!pqsecure_raw_read| len=16384 n=23 errno=0 buf=64000000166b0000000006d297800002f5019f671bc900
!!!pqsecure_raw_read| len=16384 n=46 errno=0
buf=64000000166b0000000006d297800002f5019f671c970064000000166b0000000006d298000002f5019f671cbf00
!!!pqsecure_raw_read| len=16384 n=181 errno=0
buf=45000000b453464154414c0056464154414c00433537503034004d7465726d696e6174696e6720636f6e6e656374696f6e2064756520746f20636f6e666c6963742077697468207265636f766572790044557365722077617320636f6e6e656374656420746f20612064617461626173652074686174206d7573742062652064726f707065642e0046706f7374677265732e63004c3332313300527265706f72745f7265636f766572795f636f6e666c6963740000
### FATAL...57P04...terminating connection due to conflict with recovery
pg_recvlogical: error: unexpected termination of replication stream: FATAL: terminating connection due to conflict with
recovery
DETAIL: User was connected to a database that must be dropped.
pg_recvlogical: error: disconnected
Thus, FATAL is received by pg_recvlogical as expected, if it follows
CopyBothResponse in the stream.
Having this knowledge, I could reproduce such behavior simply with:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1501,6 +1501,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
ReplicationSlotAcquire(cmd->slotname, true, true);
+pg_usleep(1000000);
/*
* Force a disconnect, so that the decoding code doesn't need to care
* about an eventual switch from running in recovery, to running in a
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 69/?
# Failed test 'slot has been invalidated'
# at t/035_standby_logical_decoding.pl line 119.
# 'pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT
"drop_db_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')": server closed the connection
unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# pg_recvlogical: error: disconnected
# '
# doesn't match '(?^:conflict with recovery)'
t/035_standby_logical_decoding.pl .. 88/? # Looks like you failed 1 test of 89.
Reproduced on REL_16_STABLE (where this test was landed) .. master.
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2026-05-11%2012%3A25%3A04
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2025-08-04%2005%3A45%3A46
Best regards,
Alexander
| Attachment | Content-Type | Size |
|---|---|---|
| debug-035_standby_logical_decoding.patch | text/x-patch | 3.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | solai v | 2026-05-17 14:50:50 | Re: Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment |