Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-28 09:44:08
Message-ID: CAJpy0uBw1sYgimV4o_JYbwapwCJO+FuZw1C5TtquTecPOfPxsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 28, 2023 at 3:10 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > On 11/28/23 4:13 AM, shveta malik wrote:
> > > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >>
> > >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu)
> > >> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >>>
> > >>> Here is the updated version(v39_2) which include all the changes made in 0002.
> > >>> Please use for review, and sorry for the confusion.
> > >>>
> > >>
> > >> --- a/src/backend/replication/logical/launcher.c
> > >> +++ b/src/backend/replication/logical/launcher.c
> > >> @@ -8,20 +8,27 @@
> > >> * src/backend/replication/logical/launcher.c
> > >> *
> > >> * NOTES
> > >> - * This module contains the logical replication worker launcher which
> > >> - * uses the background worker infrastructure to start the logical
> > >> - * replication workers for every enabled subscription.
> > >> + * This module contains the replication worker launcher which
> > >> + * uses the background worker infrastructure to:
> > >> + * a) start the logical replication workers for every enabled subscription
> > >> + * when not in standby_mode.
> > >> + * b) start the slot sync worker for logical failover slots synchronization
> > >> + * from the primary server when in standby_mode.
> > >>
> > >> I was wondering do we really need a launcher on standby to invoke
> > >> sync-slot worker. If so, why? I guess it may be required for previous
> > >> versions where we were managing work for multiple slot-sync workers
> > >> which is also questionable in the sense of whether launcher is the
> > >> right candidate for the same but now with the single slot-sync worker,
> > >> it doesn't seem worth having it. What do you think?
> > >>
> > >> --
> > >
> > > Yes, earlier a manager process was needed to manage multiple slot-sync
> > > workers and distribute load among them, but now that does not seem
> > > necessary. I gave it a try (PoC) and it seems to work well. If there
> > > are no objections to this approach, I can share the patch soon.
> > >
> >
> > +1 on this new approach, thanks!
>
> PFA v40. This patch has removed Logical Replication Launcher support
> to launch slotsync worker. The slot-sync worker is now registered as
> bgworker with postmaster, with
> bgw_start_time=BgWorkerStart_ConsistentState and
> bgw_restart_time=60sec.
>
> On removal of launcher, now all the validity checks have been shifted
> to slot-sync worker itself. This brings us to some point of concerns:
>
> a) We still need to maintain RecoveryInProgress() check in slotsync
> worker. Since worker has the start time of
> BgWorkerStart_ConsistentState, it will be started on non-standby as
> well. So to ensure that it exists on non-standby, "RecoveryInProgress"
> has been introduced at the beginning of the worker. But once it exits,
> postmaster will not restart it since it will be clean-exist i.e.
> proc_exit(0) (the restart logic of postmaster comes into play only
> when there is an abnormal exit). But to exit for the first time on
> non-standby, we need that Recovery related check in worker.
>
> b) "enable_syncslot" check is moved to slotsync worker now. Since
> enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to
> exit the worker if 'enable_syncslot' is found to be disabled.
> 'proc_exit(1)' has been used in order to ensure that the worker is
> restarted and GUCs are checked again after restart_time. Downside of
> this approach is, if someone has kept "enable_syncslot" as disabled
> permanently even on standby, slotsync worker will keep on restarting
> and exiting.
>
> So to overcome the above pain-points, I think a potential approach
> will be to start slotsync worker only if 'enable_syncslot' is on and
> the system is non-standby. Potential ways (each with some issues) are:
>

Correction here: start slotsync worker only if 'enable_syncslot' is
on and the system is standby.

> 1) Use the current way i.e. register slot-sync worker as bgworker with
> postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But
> this seems more like a hack. This will need extra changes as currently
> once 'maybe_start_bgworkers' is attempted by postmaster, it will
> attempt again to start any worker only if the worker had abnormal exit
> and restart_time !=0. The current postmatser will not attempt to start
> worker on any GUC change.
>
> 2) Another way maybe to treat slotsync worker as special case and
> separate out the start/restart of slotsync worker from bgworker, and
> follow what we do for autovacuum launcher(StartAutoVacLauncher) to
> keep starting it in the postmaster loop(ServerLoop). In this way, we
> may be able to add more checks before starting worker. But by opting
> this approach, we will have to manage slotsync worker completely by
> ourself as it will be no longer be part of existing
> bgworker-registration infra. If this seems okay and there are no other
> better options, it can be analyzed further in detail.
>
> 3) Another approach could be, in order to solve issue (a), introduce a
> new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means
> start a bgworker only if consistent state is reached and the system is
> standby. And for issue (b), lets retain check of enable_syncslot in
> the worker itself but make it 'PGC_POSTMASTER'. This will ensure we
> can safely exit the worker(proc_exit(0) if enable_syncslot is disabled
> and postmaster will not restart it. But I'm not sure if making it
> "PGC_POSTMASTER" is acceptable from the user's perspective.
>
> thanks
> Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-11-28 09:50:25 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message shveta malik 2023-11-28 09:40:21 Re: Synchronizing slots from primary to standby