Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-28 09:40:21
Message-ID: CAJpy0uAa6HnPKCO8pM7xFfEmzXtXDmgRvX5ErJX8e8-mWyvXYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On 11/28/23 4:13 AM, shveta malik wrote:
> > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >>
> >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu)
> >> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >>>
> >>> Here is the updated version(v39_2) which include all the changes made in 0002.
> >>> Please use for review, and sorry for the confusion.
> >>>
> >>
> >> --- a/src/backend/replication/logical/launcher.c
> >> +++ b/src/backend/replication/logical/launcher.c
> >> @@ -8,20 +8,27 @@
> >> * src/backend/replication/logical/launcher.c
> >> *
> >> * NOTES
> >> - * This module contains the logical replication worker launcher which
> >> - * uses the background worker infrastructure to start the logical
> >> - * replication workers for every enabled subscription.
> >> + * This module contains the replication worker launcher which
> >> + * uses the background worker infrastructure to:
> >> + * a) start the logical replication workers for every enabled subscription
> >> + * when not in standby_mode.
> >> + * b) start the slot sync worker for logical failover slots synchronization
> >> + * from the primary server when in standby_mode.
> >>
> >> I was wondering do we really need a launcher on standby to invoke
> >> sync-slot worker. If so, why? I guess it may be required for previous
> >> versions where we were managing work for multiple slot-sync workers
> >> which is also questionable in the sense of whether launcher is the
> >> right candidate for the same but now with the single slot-sync worker,
> >> it doesn't seem worth having it. What do you think?
> >>
> >> --
> >
> > Yes, earlier a manager process was needed to manage multiple slot-sync
> > workers and distribute load among them, but now that does not seem
> > necessary. I gave it a try (PoC) and it seems to work well. If there
> > are no objections to this approach, I can share the patch soon.
> >
>
> +1 on this new approach, thanks!

PFA v40. This patch has removed Logical Replication Launcher support
to launch slotsync worker. The slot-sync worker is now registered as
bgworker with postmaster, with
bgw_start_time=BgWorkerStart_ConsistentState and
bgw_restart_time=60sec.

On removal of launcher, now all the validity checks have been shifted
to slot-sync worker itself. This brings us to some point of concerns:

a) We still need to maintain RecoveryInProgress() check in slotsync
worker. Since worker has the start time of
BgWorkerStart_ConsistentState, it will be started on non-standby as
well. So to ensure that it exists on non-standby, "RecoveryInProgress"
has been introduced at the beginning of the worker. But once it exits,
postmaster will not restart it since it will be clean-exist i.e.
proc_exit(0) (the restart logic of postmaster comes into play only
when there is an abnormal exit). But to exit for the first time on
non-standby, we need that Recovery related check in worker.

b) "enable_syncslot" check is moved to slotsync worker now. Since
enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to
exit the worker if 'enable_syncslot' is found to be disabled.
'proc_exit(1)' has been used in order to ensure that the worker is
restarted and GUCs are checked again after restart_time. Downside of
this approach is, if someone has kept "enable_syncslot" as disabled
permanently even on standby, slotsync worker will keep on restarting
and exiting.

So to overcome the above pain-points, I think a potential approach
will be to start slotsync worker only if 'enable_syncslot' is on and
the system is non-standby. Potential ways (each with some issues) are:

1) Use the current way i.e. register slot-sync worker as bgworker with
postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But
this seems more like a hack. This will need extra changes as currently
once 'maybe_start_bgworkers' is attempted by postmaster, it will
attempt again to start any worker only if the worker had abnormal exit
and restart_time !=0. The current postmatser will not attempt to start
worker on any GUC change.

2) Another way maybe to treat slotsync worker as special case and
separate out the start/restart of slotsync worker from bgworker, and
follow what we do for autovacuum launcher(StartAutoVacLauncher) to
keep starting it in the postmaster loop(ServerLoop). In this way, we
may be able to add more checks before starting worker. But by opting
this approach, we will have to manage slotsync worker completely by
ourself as it will be no longer be part of existing
bgworker-registration infra. If this seems okay and there are no other
better options, it can be analyzed further in detail.

3) Another approach could be, in order to solve issue (a), introduce a
new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means
start a bgworker only if consistent state is reached and the system is
standby. And for issue (b), lets retain check of enable_syncslot in
the worker itself but make it 'PGC_POSTMASTER'. This will ensure we
can safely exit the worker(proc_exit(0) if enable_syncslot is disabled
and postmaster will not restart it. But I'm not sure if making it
"PGC_POSTMASTER" is acceptable from the user's perspective.

thanks
Shveta

Attachment Content-Type Size
v40-0002-Add-logical-slot-sync-capability-to-the-physical.patch application/octet-stream 75.3 KB
v40-0003-Allow-slot-sync-worker-to-wait-for-the-cascading.patch application/octet-stream 8.5 KB
v40-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 134.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-11-28 09:44:08 Re: Synchronizing slots from primary to standby
Previous Message Daniel Gustafsson 2023-11-28 09:38:12 Re: SSL tests fail on OpenSSL v3.2.0