Re: Exit walsender before confirming remote flush in logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exit walsender before confirming remote flush in logical replication
Date: 2023-01-17 09:11:00
Message-ID: CAA4eK1JT0AGBgNGN3TbsLXb+ybxG1D00LK+UNzrPUupSx3nK0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> > In logical replication apply preceeds write and flush so we have no
> > indication whether a record is "replicated" to standby by other than
> > apply LSN. On the other hand, logical recplication doesn't have a
> > business with switchover so that assurarance is useless. Thus I think
> > we can (practically) ignore apply_lsn at shutdown. It seems subtly
> > irregular, though.
>
> Another consideration is that the condition (!pq_is_send_pending()) ensures that
> there are no pending messages, including other packets. Currently we force walsenders
> to clean up all messages before shutting down, even if it is a keepalive one.
> I cannot have any problems caused by this, but I can keep the condition in case of
> logical replication.
>

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend. Similarly, if the user has configured a
large value of time-delayed apply, it can lead to a network buffer
full between walsender and receive/process.

The condition to allow the shutdown to wait for all WAL to be sent has
two parts: (a) it confirms that there is no pending WAL to be sent;
(b) it confirms all the WAL sent has been flushed by the client. As
per our understanding, both these conditions are to allow clean
switchover/failover which seems to be useful only for physical
replication. The logical replication doesn't provide such
functionality.

The proposed patch tries to eliminate condition (b) for logical
replication in the hopes that the same will allow the shutdown to be
complete in most cases. There is no specific reason discussed to not
do (a) for logical replication.

Now, to proceed here we have the following options: (1) Fix (b) as
proposed by the patch and document the risks related to (a); (2) Fix
both (a) and (b); (3) Do nothing and document that users need to
unblock the subscribers to complete the shutdown.

Thoughts?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2023-01-17 09:14:44 RE: Perform streaming logical transactions by background workers and parallel apply
Previous Message Alvaro Herrera 2023-01-17 08:32:11 Re: Rethinking the implementation of ts_headline()