Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)
Date: 2022-02-26 08:47:50
Message-ID: CALj2ACUbo5euJAd6cgmNGi5Q7Og2Da07VqE_TQj8KepmjM9UfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 26, 2022 at 1:08 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
> > Thanks Satya and others for the inputs. Here's the v1 patch that
> > basically allows async wal senders to wait until the sync standbys
> > report their flush lsn back to the primary. Please let me know your
> > thoughts.
>
> I haven't had a chance to look too closely yet, but IIUC this adds a new
> function that waits for synchronous replication. This new function
> essentially spins until the synchronous LSN has advanced.
>
> I don't think it's a good idea to block sending any WAL like this. AFAICT
> it is possible that there will be a lot of synchronously replicated WAL
> that we can send, and it might just be the last several bytes that cannot
> yet be replicated to the asynchronous standbys. І believe this patch will
> cause the server to avoid sending _any_ WAL until the synchronous LSN
> advances.
>
> Perhaps we should instead just choose the SendRqstPtr based on the current
> synchronous LSN. Presumably there are other things we'd need to consider,
> but in general, I think we ought to send as much WAL as possible for a
> given call to XLogSendPhysical().

A global min LSN of SendRqstPtr of all the sync standbys can be
calculated and the async standbys can send WAL up to global min LSN.
This is unlike what the v1 patch does i.e. async standbys will wait
until the sync standbys report flush LSN back to the primary. Problem
with the global min LSN approach is that there can still be a small
window where async standbys can get ahead of sync standbys. Imagine
async standbys being closer to the primary than sync standbys and if
the failover has to happen while the WAL at SendRqstPtr isn't received
by the sync standbys, but the async standbys can receive them as they
are closer. We hit the same problem that we are trying to solve with
this patch. This is the reason, we are waiting till the sync flush LSN
as it guarantees more transactional protection.

Do you think allowing async standbys optionally wait for either remote
write or flush or apply or global min LSN of SendRqstPtr so that users
can choose what they want?

> > I've done pgbench testing to see if the patch causes any problems. I
> > ran tests two times, there isn't much difference in the txns per
> > seconds (tps), although there's a delay in the async standby receiving
> > the WAL, after all, that's the feature we are pursuing.
>
> I'm curious what a longer pgbench run looks like when the synchronous
> replicas are in the same region. That is probably a more realistic
> use-case.

We are performing more tests, I will share the results once done.

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gunnar "Nick" Bluth 2022-02-26 08:55:20 Re: PATCH: add "--config-file=" option to pg_rewind
Previous Message Joel Jacobson 2022-02-26 08:27:21 Re: [PATCH] pg_permissions