Re: Exit walsender before confirming remote flush in logical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exit walsender before confirming remote flush in logical replication
Date: 2023-02-02 05:17:55
Message-ID: CAD21AoBfzHHf2JRftOp8tn8yx0P38=bcaQQCnE2CJjfPGGMhnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > Let me try to summarize the discussion till now. The problem we are
> > > > trying to solve here is to allow a shutdown to complete when walsender
> > > > is not able to send the entire WAL. Currently, in such cases, the
> > > > shutdown fails. As per our current understanding, this can happen when
> > > > (a) walreceiver/walapply process is stuck (not able to receive more
> > > > WAL) due to locks or some other reason; (b) a long time delay has been
> > > > configured to apply the WAL (we don't yet have such a feature for
> > > > logical replication but the discussion for same is in progress).
> > > >
> > > > Both reasons mostly apply to logical replication because there is no
> > > > separate walreceiver process whose job is to just flush the WAL. In
> > > > logical replication, the process that receives the WAL also applies
> > > > it. So, while applying it can stuck for a long time waiting for some
> > > > heavy-weight lock to be released by some other long-running
> > > > transaction by the backend.
> > > >
> ...
> ...
> >
> > +1 to eliminate condition (b) for logical replication.
> >
> > Regarding (a), as Amit mentioned before[1], I think we should check if
> > pq_is_send_pending() is false.
> >
>
> Sorry, but your suggestion is not completely clear to me. Do you mean
> to say that for logical replication, we shouldn't wait for all the WAL
> to be successfully replicated but we should ensure to inform the
> subscriber that XLOG streaming is done (by ensuring
> pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

Yes.

>
> > Otherwise, we will end up terminating
> > the WAL stream without the done message. Which will lead to an error
> > message "ERROR: could not receive data from WAL stream: server closed
> > the connection unexpectedly" on the subscriber even at a clean
> > shutdown.
> >
>
> But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> disallows new connections, then waits for all existing clients to
> disconnect. If the server is in hot standby, recovery and streaming
> replication will be terminated once all clients have disconnected.),
> there is no such guarantee.

In smart shutdown case, the walsender doesn't exit until it can flush
the done message, no?

> I see that it is required for the
> switchover in physical replication to ensure that all the WAL is sent
> and replicated but we don't need that for logical replication.

It won't be a problem in practice in terms of logical replication. But
I'm concerned that this error could confuse users. Is there any case
where the client gets such an error at the smart shutdown?

>
> > In a case where pq_is_send_pending() doesn't become false
> > for a long time, (e.g., the network socket buffer got full due to the
> > apply worker waiting on a lock), I think users should unblock it by
> > themselves. Or it might be practically better to shutdown the
> > subscriber first in the logical replication case, unlike the physical
> > replication case.
> >
>
> Yeah, will users like such a dependency? And what will they gain by doing so?

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1], it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

Regards,

[1] https://www.postgresql.org/docs/devel/server-shutdown.html

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-02-02 05:34:17 Re: recovery modules
Previous Message Michael Paquier 2023-02-02 04:45:47 Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl