Re: Exit walsender before confirming remote flush in logical replication

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Andrey Silitskiy <a(dot)silitskiy(at)postgrespro(dot)ru>
Cc: Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ronan Dunklau <ronan(at)dunklau(dot)fr>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: Re: Exit walsender before confirming remote flush in logical replication
Date: 2026-03-25 12:39:00
Message-ID: CAHGQGwHZnhodg7+xQZQOoRhX+emPUoPPgKZm7fOYOH3DpxxC8g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 16, 2026 at 5:52 AM Andrey Silitskiy
<a(dot)silitskiy(at)postgrespro(dot)ru> wrote:
>
> On Fri, 13 Mar 2026 Greg Sabino Mullane <htamfids(at)gmail(dot)com> wrote:
> > +1. I don't think we need to measure any times, but we do need to
> > exercise that whole part of the code, ...
>
> Added a case with a positive but small wal_sender_shutdown_timeout
> to the test.

Thanks for updating the patch!

I tested wal_sender_shutdown_timeout under several configurations and
encountered a case where the primary shutdown got stuck, even with the patch
and wal_sender_shutdown_timeout = 1. I'm not sure yet whether this is a bug in
the patch or an issue with my test setup, but anyway I'd like to share
the reproduction steps for reference.

--------------------------------------------------------------------------------
#1. Set up primary, standby (with slot sync), and subscriber

initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
wal_sender_timeout = 1h
wal_sender_shutdown_timeout = 1
log_line_prefix = '%t %p [%b] data '
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot

pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
log_line_prefix = '%t %p [%b] sby1 '
EOF
pg_ctl -D sby1 start

initdb -D sub1 --encoding=UTF8 --locale=C
cat <<EOF >> sub1/postgresql.conf
port = 5434
wal_level = logical
log_line_prefix = '%t %p [%b] sub1 '
EOF
pg_ctl -D sub1 start

#2. Create table, publication, and subscription

psql -p 5432 <<EOF
create table t (i int primary key, j int);
create publication mypub for table t;
EOF

psql -p 5434 <<EOF
create table t (i int primary key, j int);
create subscription mysub connection 'port=5432' publication mypub
with (failover = 'on');
EOF

#3. Lock the table on the subscriber

psql -p 5434 <<EOF
begin;
lock table t in access exclusive mode;
select pg_sleep(1000);
EOF

#4. Confirm the apply worker is waiting

psql -p 5432 -c "insert into t values(1, 0)"

psql -p 5434 -c "select * from pg_stat_activity where backend_type
like '%apply worker%'"

#5. Block walreceiver on the standby by using SIGSTOP signal

kill -SIGSTOP $(psql -p 5433 -X -Atc "SELECT pid FROM pg_stat_wal_receiver")

#6. Insert another row and shut down the primary

psql -p 5432 -c "insert into t values(2, 0)"

pg_ctl -D data stop
--------------------------------------------------------------------------------

In my tests, the shutdown in step #6 got stuck.

Any thoughts on whether this indicates a problem in the patch or
something off in my setup?

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2026-03-25 13:13:49 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
Previous Message Andrey Borodin 2026-03-25 12:29:52 Re: Two issues leading to discrepancies in FSM data on the standby server