Re: Exit walsender before confirming remote flush in logical replication

From: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andrey Silitskiy <a(dot)silitskiy(at)postgrespro(dot)ru>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: Re: Exit walsender before confirming remote flush in logical replication
Date: 2026-01-20 17:03:55
Message-ID: e25567b4-9893-48bf-ac17-0e884f1acef9@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Hackers,

I think, I reproduced test fails. The test fails because walsender is in
waiting state in WalSndDoneImmediate -> ereport with the following stack (see
below). It seems, it tries to send the message to the replica and flush it, but
the replica is hung.

#0 0x00007a4b37f2a037 in epoll_wait
#1 0x000056855317a2e8 in WaitEventSetWaitBlock
#2 WaitEventSetWait
#3 0x0000568552feea8e in secure_write
#4 0x0000568552ff5666 in internal_flush_buffer
#5 0x0000568552ff5966 in internal_flush
#6 socket_flush ()
#7 socket_flush ()
#8 0x00005685532ff1b3 in send_message_to_frontend (edata=<optimized out>)
#9 EmitErrorReport ()
#10 0x00005685532ff6dd in errfinish
#11 0x000056855312cc9c in WalSndDoneImmediate () at walsender.c:3625

I would propose to remove the ereport call from WalSndDoneImmediate.

With best regards,
Vitaly

On 1/19/26 15:41, Fujii Masao wrote:
> On Sun, Jan 18, 2026 at 1:20 AM Andrey Silitskiy
> <a(dot)silitskiy(at)postgrespro(dot)ru> wrote:
>>
>> On Jan 9, 2026 at 10:04 AM Fujii Masao
>> <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> Why do we need to send a "done" message to the receiver here?
>>> Since delivery isn't guaranteed in immediate mode, it seems of limited
>>> value.
>>
>> It seems to me that it is better to send a message in cases where it is
>> possible, so as not to raise errors on the subscriber during a clean shutdown.
>> And when this is not possible, exit the process without waiting.
>>
>>> For the immediate mode, would it make sense to log that the walsender is
>>> terminating in immediate mode and that WAL replication may be incomplete,
>>> so users can more easily understand what happened?
>>
>> Added to the latest patch.
>
> Thanks for updating the patch!
>
> cfbot is reporting a test failure. Could you please look into it and
> fix the issue?
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F6234
>
> Regards,
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2026-01-20 17:12:00 Re: log_min_messages per backend type
Previous Message Andres Freund 2026-01-20 17:03:30 Re: meson: Allow disabling static libraries