Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers
Date: 2022-01-07 17:44:15
Message-ID: CAHg+QDdfjQS6c8JAcV5z+ZfdGCYYm4xDvEFKHGkToGJdT8pd9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 7, 2022 at 12:27 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
wrote:

> At Thu, 6 Jan 2022 23:55:01 -0800, SATYANARAYANA NARLAPURAM <
> satyanarlapuram(at)gmail(dot)com> wrote in
> > On Thu, Jan 6, 2022 at 11:24 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> > > On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
> > > > I would like to propose a GUC send_Wal_after_quorum_committed which
> > > > when set to ON, walsenders corresponds to async standbys and logical
> > > > replication workers wait until the LSN is quorum committed on the
> > > > primary before sending it to the standby. This not only simplifies
> > > > the post failover steps but avoids unnecessary downtime for the async
> > > > replicas. Thoughts?
> > >
> > > Do we need a GUC? Or should we just always require that sync rep is
> > > satisfied before sending to async replicas?
> > >
> >
> > I proposed a GUC to not introduce a behavior change by default. I have no
> > strong opinion on having a GUC or making the proposed behavior default,
> > would love to get others' perspectives as well.
> >
> >
> > >
> > > It feels like the sync quorum should always be ahead of the async
> > > replicas. Unless I'm missing a use case, or there is some kind of
> > > performance gotcha.
> > >
> >
> > I couldn't think of a case that can cause serious performance issues but
> > will run some experiments on this and post the numbers.
>
> I think Jeff is saying that "quorum commit" already by definition
> means that all out-of-quorum standbys are behind of the
> quorum-standbys. I agree to that in a dictionary sense. But I can
> think of the case where the response from the top-runner standby
> vanishes or gets caught somewhere on network for some reason. In that
> case the primary happily checks quorum ignoring the top-runner.
>
> To avoid that misdecision, I can guess two possible "solutions".
>
> One is to serialize WAL sending (of course it is unacceptable at all)
> or aotehr is to send WAL to all standbys at once then make the
> decision after making sure receiving replies from all standbys (this
> is no longer quorum commit in another sense..)
>

There is no need to serialize sending the WAL among sync standbys. The only
serialization required is first to all the sync replicas and then to sync
replicas if any. Once an LSN is quorum committed, no failover subsystem
initiates an automatic failover such that the LSN is lost (data loss)

>
> So I'm afraid that there's no sensible solution to avoid the
> hiding-forerunner problem on quorum commit.
>

Could you elaborate on the problem here?

>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2022-01-07 17:50:23 Re: row filtering for logical replication
Previous Message Justin Pryzby 2022-01-07 17:36:24 Re: Python Plain Text Sender