Re: Allow async standbys wait for sync replication

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Allow async standbys wait for sync replication
Date: 2022-03-15 07:38:12
Message-ID: CALj2ACWCj60g6TzYMbEO07ZhnBGbdCveCrD413udqbRM0O59RA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 9, 2022 at 7:31 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2022-03-06 12:27:52 +0530, Bharath Rupireddy wrote:
> > On Sun, Mar 6, 2022 at 1:57 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > Hi,
> > >
> > > On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
> > > > I understand. Even if we use the SyncRepWaitForLSN approach, the async
> > > > walsenders will have to do nothing in WalSndLoop() until the sync
> > > > walsender wakes them up via SyncRepWakeQueue.
> > >
> > > I still think we should flat out reject this approach. The proper way to
> > > implement this feature is to change the protocol so that WAL can be sent to
> > > replicas with an additional LSN informing them up to where WAL can be
> > > flushed. That way WAL is already sent when the sync replicas have acknowledged
> > > receipt and just an updated "flush/apply up to here" LSN has to be sent.
> >
> > I was having this thought back of my mind. Please help me understand these:
> > 1) How will the async standbys ignore the WAL received but
> > not-yet-flushed by them in case the sync standbys don't acknowledge
> > flush LSN back to the primary for whatever reasons?
>
> What do you mean with "ignore"? When replaying?

Let me illustrate with an example:

1) Say, primary at LSN 100, sync standby at LSN 90 (is about to
receive/receiving the WAL from LSN 91 - 100 from primary), async
standby at LSN 100 - today this is possible if the async standby is
closer to primary than sync standby for whatever reasons
2) With the approach that's originally proposed in this thread - async
standbys can never get ahead of LSN 90 (flush LSN reported back to the
primary by all sync standbys)
3) With the approach that's suggested i.e. "let async standbys receive
WAL at their own pace, but they should only be allowed to
apply/write/flush to the WAL file in pg_wal directory/disk until the
sync standbys latest flush LSN" - async standbys can receive the WAL
from LSN 91 - 100 but they aren't allowed to apply/write/flush. Where
will the async standbys hold the WAL from LSN 91 - 100 until the
latest flush LSN (100) is reported to them? If they "somehow" store
the WAL from LSN 91 - 100 and not apply/write/flush, how will they
ignore that WAL, say if the sync standbys don't report the latest
flush LSN back to the primary (for whatever reasons)? In such cases,
the primary has no idea of the latest sync standbys flush LSN (?) if
at all the sync standbys can't come up and reconnect and resync with
the primary? Should the async standby always assume that the WAL from
LSN 91 -100 is invalid for them as they haven't received the sync
flush LSN from primary? In such a case, aren't there "invalid holes"
in the WAL files on the async standbys?

> I think this'd require adding a new pg_control field saying up to which LSN
> WAL is "valid". If that field is set, replay would only replay up to that LSN
> unless some explicit operation is taken to replay further (e.g. for data
> recovery).

With the approach that's suggested i.e. "let async standbys receive
WAL at their own pace, but they should only be allowed to
apply/write/flush to the WAL file in pg_wal directory/disk until the
sync standbys latest flush LSN'' - there can be 2 parts to the WAL on
async standbys - most of it "valid and makes sense for async standbys"
and some of it "invalid and doesn't make sense for async standbys''?
Can't this require us to rework some parts like "redo/apply/recovery
logic on async standbys'', tools like pg_basebackup, pg_rewind,
pg_receivewal, pg_recvlogical, cascading replication etc. that depend
on WAL records and now should know whether the WAL records are valid
for them? I may be wrong here though.

> > 2) When we say the async standbys will receive the WAL, will they just
> > keep the received WAL in the shared memory but not apply or will they
> > just write but not apply the WAL and flush the WAL to the pg_wal
> > directory on the disk or will they write to some other temp wal
> > directory until they receive go-ahead LSN from the primary?
>
> I was thinking that for now it'd go to disk, but eventually would first go to
> wal_buffers and only to disk if wal_buffers needs to be flushed out (and only
> in that case the pg_control field would need to be set).

IIUC, the WAL buffers (XLogCtl->pages) aren't used on standbys as wal
receivers bypass them and flush the data directly to the disk. Hence,
the WAL buffers that are allocated(?, I haven't checked the code
though) but unused on standbys can be used to hold the WAL until the
new flush LSN is reported from the primary. At any point of time, the
WAL buffers will have the latest WAL that's waiting for a new flush
LSN from the primary. However, this can be a problem for larger
transactions that can eat up the entire WAL buffers and flush LSN is
far behind in which case we need to flush the WAL to the latest WAL
file in pg_wal/disk but let the other folks in the server know upto
which the WAL is valid.

> > 3) Won't the network transfer cost be wasted in case the sync standbys
> > don't acknowledge flush LSN back to the primary for whatever reasons?
>
> That should be *extremely* rare, and in that case a bit of wasted traffic
> isn't going to matter.

Agree.

> > The proposed idea in this thread (async standbys waiting for flush LSN
> > from sync standbys before sending the WAL), although it makes async
> > standby slower in receiving the WAL, it doesn't have the above
> > problems and is simpler to implement IMO. Since this feature is going
> > to be optional with a GUC, users can enable it based on the needs.
>
> To me it's architecturally the completely wrong direction. We should move in
> the *other* direction, i.e. allow WAL to be sent to standbys before the
> primary has finished flushing it locally. Which requires similar
> infrastructure to what we're discussing here.

Agree.

* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.

Like others pointed out, if done above, it's possible to achieve
"allow WAL to be sent to standbys before the primary has finished
flushing it locally".

I would like to hear more thoughts and then summarize the design
points a bit later.

Regards,
Bharath Rupireddy.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2022-03-15 07:40:12 Re: Can we consider "24 Hours" for "next day" in INTERVAL datatype ?
Previous Message Kyotaro Horiguchi 2022-03-15 07:25:45 Re: BufferAlloc: don't take two simultaneous locks