| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Andrey Silitskiy <a(dot)silitskiy(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ronan Dunklau <ronan(at)dunklau(dot)fr>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com> |
| Subject: | Re: Exit walsender before confirming remote flush in logical replication |
| Date: | 2026-05-31 11:00:01 |
| Message-ID: | 923f6d11-f43b-48c8-a8ed-3f7da6493374@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello Fujii-san,
06.04.2026 18:32, Andres Freund wrote:
> Hi,
>
> On 2026-04-06 11:40:24 +0900, Fujii Masao wrote:
>> I've pushed the patch. Thanks!
> Failed on CI just now:
>
> https://cirrus-ci.com/task/6745359004729344?logs=test_world#L410
> https://api.cirrus-ci.com/v1/artifact/task/6745359004729344/testrun/build/testrun/subscription/038_walsnd_shutdown_timeout/log/regress_log_038_walsnd_shutdown_timeout
A recent failure occurred on the buildfarm [1] indicates another stability
issue, this time with the test itself, I guess:
[21:12:03.301](0.068s) ok 3 - have walreceiver pid 2275585
### Stopping node "publisher" using mode fast
# Running: pg_ctl --pgdata
/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/038_walsnd_shutdown_timeout/data/t_038_walsnd_shutdown_timeout_publisher_data/pgdata
--mode fast stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "publisher"
[21:12:03.752](0.452s) not ok 4 - walsender exits due to wal_sender_shutdown_timeout even when both physical and logical
replication are stalled
[21:12:03.753](0.001s) # Failed test 'walsender exits due to wal_sender_shutdown_timeout even when both physical and
logical replication are stalled'
# at /home/bf/bf-build/tamandua/HEAD/pgsql/src/test/subscription/t/038_walsnd_shutdown_timeout.pl line 189.
### Stopping node "subscriber" using mode fast
038_walsnd_shutdown_timeout_subscriber.log doesn't really contain the
expected warning:
2026-05-29 21:11:58.777 CEST [2273817][logical replication apply worker][124/2:0] LOG: logical replication apply worker
for subscription "test_sub" has started
2026-05-29 21:12:03.232 CEST [2271870][client backend][4/6:0] LOG: statement: BEGIN;
2026-05-29 21:12:03.232 CEST [2271870][client backend][4/6:0] LOG: statement: LOCK TABLE test_tab IN EXCLUSIVE MODE;
2026-05-29 21:12:03.232 CEST [2271870][client backend][4/6:0] LOG: statement: ;
2026-05-29 21:12:03.268 CEST [2273817][logical replication apply worker][124/0:0] LOG: received message via
replication: WARNING: replication slot "test_slot" specified in parameter "synchronized_standby_slots" does not have
active_pid
DETAIL: Logical replication is waiting on the standby associated with replication slot "test_slot".
HINT: Start the standby associated with the replication slot "test_slot", or amend parameter
"synchronized_standby_slots".
2026-05-29 21:12:03.433 CEST [2273817][logical replication apply worker][124/0:0] ERROR: could not receive data from
WAL stream: ERROR: replication slot "test_slot" specified in parameter "synchronized_standby_slots" does not have
active_pid
DETAIL: Logical replication is waiting on the standby associated with replication slot "test_slot".
HINT: Start the standby associated with the replication slot "test_slot", or amend parameter
"synchronized_standby_slots".
2026-05-29 21:12:03.437 CEST [2271017][postmaster][:0] LOG: background worker "logical replication apply worker" (PID
2273817) exited with exit code 1
2026-05-29 21:12:03.439 CEST [2275785][logical replication apply worker][125/2:0] LOG: logical replication apply worker
for subscription "test_sub" has started
2026-05-29 21:12:03.441 CEST [2275785][logical replication apply worker][125/0:0] ERROR: apply worker for subscription
"test_sub" could not connect to the publisher: connection to server on socket "/tmp/3gDgOpAsXL/.s.PGSQL.27739" failed:
FATAL: the database system is shutting down
2026-05-29 21:12:03.445 CEST [2271017][postmaster][:0] LOG: background worker "logical replication apply worker" (PID
2275785) exited with exit code 1
2026-05-29 21:12:03.763 CEST [2271870][client backend][:0] LOG: disconnection: session time: 0:00:08.180 user=bf
database=postgres host=[local]
2026-05-29 21:12:03.767 CEST [2271017][postmaster][:0] LOG: received fast shutdown request
I think this can be explained by the fact that walrcv->ready_to_display
is set before WalReceiverMain's loop reached. I've reproduced this test
failure with:
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -280,6 +280,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
walrcv->sender_port = 0;
walrcv->ready_to_display = true;
SpinLockRelease(&walrcv->mutex);
+pg_usleep(500 * 1000);
/* Establish the connection to the primary for XLOG streaming */
appname = cluster_name[0] ? cluster_name : "walreceiver";
Could you please look if this can be improved?
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2026-05-29%2019%3A03%3A21
Best regards,
Alexander
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tender Wang | 2026-05-31 11:28:37 | Re: Eager aggregation, take 3 |
| Previous Message | 신성준 | 2026-05-31 10:42:41 | Re: Add wait events for server logging destination writes |