RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2023-11-29 09:11:30
Message-ID: OS0PR01MB57163425CEE6CEC16C7D6FA59483A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, November 28, 2023 11:58 PM Drouvot, Bertrand <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On 11/28/23 10:40 AM, shveta malik wrote:
> > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >>
> >> Hi,
> >>
> >> On 11/28/23 4:13 AM, shveta malik wrote:
> >>> On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >>>>
> >>>> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu)
> >>>> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >>>>>
> >>>>> Here is the updated version(v39_2) which include all the changes made
> in 0002.
> >>>>> Please use for review, and sorry for the confusion.
> >>>>>
> >>>>
> >>>> --- a/src/backend/replication/logical/launcher.c
> >>>> +++ b/src/backend/replication/logical/launcher.c
> >>>> @@ -8,20 +8,27 @@
> >>>> * src/backend/replication/logical/launcher.c
> >>>> *
> >>>> * NOTES
> >>>> - * This module contains the logical replication worker launcher which
> >>>> - * uses the background worker infrastructure to start the logical
> >>>> - * replication workers for every enabled subscription.
> >>>> + * This module contains the replication worker launcher which
> >>>> + * uses the background worker infrastructure to:
> >>>> + * a) start the logical replication workers for every enabled
> subscription
> >>>> + * when not in standby_mode.
> >>>> + * b) start the slot sync worker for logical failover slots
> synchronization
> >>>> + * from the primary server when in standby_mode.
> >>>>
> >>>> I was wondering do we really need a launcher on standby to invoke
> >>>> sync-slot worker. If so, why? I guess it may be required for
> >>>> previous versions where we were managing work for multiple
> >>>> slot-sync workers which is also questionable in the sense of
> >>>> whether launcher is the right candidate for the same but now with
> >>>> the single slot-sync worker, it doesn't seem worth having it. What do you
> think?
> >>>>
> >>>> --
> >>>
> >>> Yes, earlier a manager process was needed to manage multiple
> >>> slot-sync workers and distribute load among them, but now that does
> >>> not seem necessary. I gave it a try (PoC) and it seems to work well.
> >>> If there are no objections to this approach, I can share the patch soon.
> >>>
> >>
> >> +1 on this new approach, thanks!
> >
> > PFA v40. This patch has removed Logical Replication Launcher support
> > to launch slotsync worker.
>
> Thanks!
>
> > The slot-sync worker is now registered as bgworker with postmaster,
> > with bgw_start_time=BgWorkerStart_ConsistentState and
> > bgw_restart_time=60sec.
> >
> > On removal of launcher, now all the validity checks have been shifted
> > to slot-sync worker itself. This brings us to some point of concerns:
> >
> > a) We still need to maintain RecoveryInProgress() check in slotsync
> > worker. Since worker has the start time of
> > BgWorkerStart_ConsistentState, it will be started on non-standby as
> > well. So to ensure that it exists on non-standby, "RecoveryInProgress"
> > has been introduced at the beginning of the worker. But once it exits,
> > postmaster will not restart it since it will be clean-exist i.e.
> > proc_exit(0) (the restart logic of postmaster comes into play only
> > when there is an abnormal exit). But to exit for the first time on
> > non-standby, we need that Recovery related check in worker.
> >
> > b) "enable_syncslot" check is moved to slotsync worker now. Since
> > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to
> > exit the worker if 'enable_syncslot' is found to be disabled.
> > 'proc_exit(1)' has been used in order to ensure that the worker is
> > restarted and GUCs are checked again after restart_time. Downside of
> > this approach is, if someone has kept "enable_syncslot" as disabled
> > permanently even on standby, slotsync worker will keep on restarting
> > and exiting.
> >
> > So to overcome the above pain-points, I think a potential approach
> > will be to start slotsync worker only if 'enable_syncslot' is on and
> > the system is non-standby.
>
> That makes sense to me.
>
> > Potential ways (each with some issues) are:
> >
> > 1) Use the current way i.e. register slot-sync worker as bgworker with
> > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But
> > this seems more like a hack. This will need extra changes as currently
> > once 'maybe_start_bgworkers' is attempted by postmaster, it will
> > attempt again to start any worker only if the worker had abnormal exit
> > and restart_time !=0. The current postmatser will not attempt to start
> > worker on any GUC change.
> >
> > 2) Another way maybe to treat slotsync worker as special case and
> > separate out the start/restart of slotsync worker from bgworker, and
> > follow what we do for autovacuum launcher(StartAutoVacLauncher) to
> > keep starting it in the postmaster loop(ServerLoop). In this way, we
> > may be able to add more checks before starting worker. But by opting
> > this approach, we will have to manage slotsync worker completely by
> > ourself as it will be no longer be part of existing
> > bgworker-registration infra. If this seems okay and there are no other
> > better options, it can be analyzed further in detail.
> >
> > 3) Another approach could be, in order to solve issue (a), introduce a
> > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means
> > start a bgworker only if consistent state is reached and the system is
> > standby. And for issue (b), lets retain check of enable_syncslot in
> > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we
> > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled
> > and postmaster will not restart it. But I'm not sure if making it
> > "PGC_POSTMASTER" is acceptable from the user's perspective.
>
> I had the same idea (means make enable_syncslot as 'PGC_POSTMASTER')
> when reading b). I'm +1 on it (at least for V1) as I don't think that this parameter
> value would change frequently. Curious to know what others think too.
>
> Then as far a) is concerned, I'd vote for introducing a new
> BgWorkerStart_ConsistentState_HotStandby.

Here is the V41 patch set which includes the following changes.

V41-0001:
1) Based on the discussion[1], I update the document to remind user to
change the slot's failover option when ALTER SUBSCRIPTION ... SET
(slot_name = xx).

2) Address comments in [2][3][4].

V41-0002:
1) 'enable_syncslot' is changed from PGC_SIGHUP to PGC_POSTMASTER,
slot-sync worker will now clean exit (proc_exit(0)) if enable_syncslot is
found disabled.

2) BgWorkerStart_ConsistentState_HotStandby is introduced as new
start-time for bgworker. This will start worker only if it is standby_mode
and consistent state is reached.

3) 'SYNCSLOT_STATE_INITIATED' is now set in 'ReplicationSlotCreate' itself
in slot-sync worker case. Earlier it was set at later point of time giving
a window wherein even a synced slot was in 'n' state for quite some time,
which was not correct.

Thanks Shveta for working on the V41-0002.

[1] https://www.postgresql.org/message-id/CAA4eK1Jd9dk%3D5POTKM9p4EyYqYzLXe-AnLzHrUELjzZScLz7mw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/eb09f682-db82-41cd-93bc-5d44e10e1d6d%40gmail.com
[3] https://www.postgresql.org/message-id/CAHut%2BPsuSWjm7U_sVnL8FXZ7ZQcfCcT44kAK7i6qMG35Cwjy3A%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CAFPTHDbFqLgXS6Et%2BshNGPDjCKK66C%2BZSarqFHmQvfnAah3Qsw%40mail.gmail.com

Best Regards,
Hou zj

Attachment Content-Type Size
v41-0003-Allow-slot-sync-worker-to-wait-for-the-cascading.patch application/octet-stream 7.8 KB
v41-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 136.0 KB
v41-0002-Add-logical-slot-sync-capability-to-the-physical.patch application/octet-stream 79.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2023-11-29 09:12:54 RE: Synchronizing slots from primary to standby
Previous Message Masahiko Sawada 2023-11-29 08:19:11 Re: BackgroundPsql's set_query_timer_restart() may not work