| From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
|---|---|
| To: | Grigoriy Novikov <grigoriy(dot)novikov220(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [PATCH] Add cascade synchronous replication |
| Date: | 2026-06-15 22:24:35 |
| Message-ID: | CAPpHfdtECD9Zqkg7Q7yAz4V6Qa3uDCTWqmLM3J4BRhC4x04X8w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Grigory!
On Mon, Dec 22, 2025 at 11:06 AM Grigoriy Novikov <
grigoriy(dot)novikov220(at)gmail(dot)com> wrote:
> My apologies for not fixing the code indentation according to the
> requirements right away. I'm attaching the corrected version of the patch.
> On 11/12/25 20:05, Григорий Новиков wrote:
> > Hello hackers,
> >
> > Introduction
> > Using a large number of synchronous standbys creates excessive load on
> > the primary node. To solve this problem, cascading synchronous
> > replication can be used.
> >
> > Overview of Changes
> > This patch adds synchronous cascading replication mechanics to
> > PostgreSQL. With it, standby servers will consider configuration
> > parameters related to synchronous replication. They will select
> > walsenders LSN positions from walsdender data structures and compute
> > the synchronous LSN position for write, flush, and apply among them
> > using the synchronous replication algorithm, then calculate the
> > minimum value between these values and the corresponding positions of
> > the standby server. To avoid synchronization problems and unnecessary
> > overhead, these calculations are performed by the walreceiver process.
> > The offset positions will be transmitted in the standby reply message
> > instead of the server's own positions. This will occur if the
> > SyncRepRequested condition is met and if at least one synchronous
> > standby server is specified in synchronous_standby_names.
> > In case the walsender processes fail to calculate synchronous LSN
> > values (for example, because there are not enough synchronous
> > standbys), the server will send DefaultSendingLSN. This value is
> > between InvalidXLogRecPtr and FirstNormalUnloggedLSN. Sending
> > InvalidXLogRecPtr is not allowed because in the pg_stat_replication
> > function, a standby sending such value will be displayed as
> > asynchronous, although it is not. The value 2 was chosen for
> > DefaultSendingLSN since 1 is used by one of the access methods.
> > When receiving a DefaultSendingLSN position value from a synchronous
> > standby, the server will use it as a regular LSN. This allows
> > transaction execution to continue if the configuration permits it. If
> > not, transaction execution stops until the cluster failure is resolved.
> >
> > Overview of Individual Patch Parts
> > The first part adds the SyncRepGetSendingSyncRecPtr function, which is
> > written similarly to SyncRepGetSyncRecPtr and is responsible for
> > calculating the LSN positions to be sent. These functions contained a
> > large common code section, which was moved to the
> > SyncRepGetSyncRecPtrBySyncRepMethod function. Also, for optimization
> > purposes, the walsender process serving a synchronous standby can call
> > the WalRcvForceReply function.
> > The second part of the patch is responsible for redistributing code in
> > the syncrep.c file into sections. This is necessary to preserve the
> > semantics of the sections used in this file, since now some functions
> > can be used by the walreceiver process, while others can be used by
> > both walreceiver and walsender.
> > The third part adds a special notation in pg_stat_replication for
> > standbys sending DefaultSendingLSN. If such a standby is synchronous,
> > it is marked with a "?" symbol. In the author's opinion, this notation
> > can simplify problem searching in the cluster, but does not claim to
> > be a serious solution for failure detection.
> > The fourth part of the patch contains fixes in recovery tests numbered
> > 9 and 12. These tests created circular dependencies between servers.
> > This was not a problem as long as standby ignored synchronous
> > replication parameters, but with this patch the tests broke. Also,
> > tests for the new mechanics were added to test 7, which is responsible
> > for synchronous replication.
> >
> > Possible Topologies
> > As part of the patch, connection of asynchronous and synchronous
> > standbys to a synchronous standby is allowed. However, offset
> > positions sent by asynchronous standbys will not be considered, since
> > the synchronous replication algorithm is used. For the same reason,
> > connecting a synchronous standby to an asynchronous one is
> > theoretically possible but meaningless.
> >
> > Additional Information
> > The patch contains no platform-dependent elements, compiles with the
> > -Wall flag, and successfully passes tests. Performance optimization is
> > a separate task, and in the author's opinion, deserves a separate
> > patch. Nevertheless, local testing using Docker containers showed
> > insignificant performance degradation when using synchronous cascading
> > chains.
Thank you for your working on this important subject. I have the following
notes for this patchset.
1) New behavior is enabled by default and can't be turned off. Some setups
could have value of synchronous_standby_names GUC on their standbys
pre-defined to the case of failover / switchover. I think we need a GUC to
control new behavior, and it must be off by default (at least in the
initial release).
2) ProcessStandbyReplyMessage() uses WalRcvRequestApplyReply() to promote
changes upper to the replication tree. That's not fully semantically
correct, because WalRcvRequestApplyReply() is designed to be called when we
need to reply the apply lsn. Practically that leads XLogWalRcvSendReply()
to be called with checkApply = true and possibly extra
GetXLogReplayRecPtr() call.
3) Check for lsns chages is broken in GetXLogReplayRecPtr().
/*
* We can compare the write and flush positions to the last message we
* sent without taking any lock, but the apply position requires a spin
* lock, so we don't check that unless it is expected to advance since
the
* previous update, i.e., when 'checkApply' is true.
*/
if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
{
if (checkApply)
latestApplyPtr = GetXLogReplayRecPtr(NULL);
if (writePtr == LogstreamResult.Write
&& flushPtr == LogstreamResult.Flush
&& (!checkApply || applyPtr == latestApplyPtr))
return;
}
Here writePtr, flushPtr, and applyPtr are latest sent values,
while LogstreamResult.Write, LogstreamResult.Flush, and latestApplyPtr are
values to be sent. But with your patchset is not true anymore. This could
cause extra replies to be sent.
4) DefaultSendingLSN value is set to 2, probably because the only other
value taken seems to be GistBuildLSN 1. However, it worth re-checking no
extension has taken other values. Probably it's more safe to choose
something close to FirstNormalUnloggedLSN. Or even better to deal without
magic value, but use separate flag instead. When DefaultSendingLSN is
passed by, it's used in Min() comparison as a normal value. So, if we pass
DefaultSendingLSN, we may loose even the LSN we previously confirmed.
5) New undocumented values 'sync?' and 'quorum?' in
pg_stat_replication.sync_state. If we have more possible statuses, they
must be clearly specified and documented.
------
Regards,
Alexander Korotkov
Supabase
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Sami Imseih | 2026-06-15 22:55:35 | Re: [PATCH] COPY TO FORMAT json: respect column list order |
| Previous Message | Michael Paquier | 2026-06-15 22:15:51 | Re: pg_restore handles extended statistics inconsistently with statistics data |