RE: 035_standby_logical_decoding might fail due to FATAL message lost inside libpq

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Alexander Lakhin' <exclusion(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: 035_standby_logical_decoding might fail due to FATAL message lost inside libpq
Date: 2026-06-24 14:12:15
Message-ID: OS9PR01MB12149B91145ABFD81D0A6D642F5ED2@OS9PR01MB12149.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Alexander,

Thanks for sharing and sorry for late reply. I was busy for the company work.
I analyzed the test failure and found a candidate reason.

Per my understanding, this can happen when the FATAL error is reported while
PQexec() is still running in StreamLogicalLog(). This behavior seems to come from
the current PQexecFinish() semantics for backward compatibility, so I do not
think we should fix the libpq code. Instead, the test should be adjusted.

The attached patch shows my analysis. It adds a sleep in StartLogicalReplication()
based on your reproducer, debug logs in PQexecFinish(), and debug output in the
test function check_pg_recvlogical_stderr(). Please apply the patch and run the
test once to follow the behavior described below.

[1] shows the result when the unexpected output happens with my debug logs.
Note that the test passed because the debug log itself contained terminating
connection due to conflict with recovery. However, the following line shows that
pg_recvlogical actually reported server closed the connection unexpectedly, which
is the wrong output.

The whole output in [1] indicates that two PGRES_FATAL_ERROR results were
consumed by PQexec() / PQexecFinish(), but only the second one was returned to
the pg_recvlogical layer. This is the root cause. However, this seems to be
intentional behavior: the comment in PQexecFinish() says that only the last
result is returned for backward compatibility.

pg_usleep() is needed for the reproducer because the issue can happen only when
the FATAL error is reported during the PQexec() call for `START_REPLICATION ...`.
[2] shows the case without the sleep. In that case, the error was reported after
PQexec() had already returned PGRES_COPY_BOTH. Then PQgetCopyData(conn, &copybuf, 1)
detected the FATAL error, and the subsequent code reported it as an unexpected
termination of the replication stream.

To address this issue, I think the test should ensure that pg_recvlogical has
finished running PQexec() and entered the COPY BOTH receive loop. Let me
investigate that a bit more.

How do you feel the analysis?

[1]:
```
[22:21:11.808](0.000s) ok 78 - slot has been invalidated
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
PQexecFinish loop: errMsg=FATAL: terminating connection due to conflict with recovery
DETAIL: User was connected to a database that must be dropped.

PQexecFinish loop: errMsg=server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

PQexecFinish return: errMsg=server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

pg_recvlogical: debug: after PQexec
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "drop_db_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')": server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```

[2]:
```
[22:43:45.508](0.000s) ok 78 - slot has been invalidated
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
PQexecFinish loop: errMsg=(null)
PQexecFinish return: errMsg=(null)
pg_recvlogical: debug: after PQexec
pg_recvlogical: error: unexpected termination of replication stream: FATAL: terminating connection due to conflict with recovery
DETAIL: User was connected to a database that must be dropped.
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
0001-Analyze-035_standby_logical_decoding-failure.patch application/octet-stream 2.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2026-06-24 14:35:29 Re: RFC: Logging plan of the running query
Previous Message Fujii Masao 2026-06-24 14:02:04 Re: psql: add tab completion for subscription wal_receiver_timeout