Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-08-01 11:22:13
Message-ID: CAJpy0uBC9NbLsG5sezp-wQ0=SW3OUZML6XwYSBH5LVgZArxMyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 27, 2023 at 12:13 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy
> > > > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > > > >
> > > > > Is having one (or a few more - not
> > > > > necessarily one for each logical slot) worker for all logical slots
> > > > > enough?
> > > > >
> > > >
> > > > I guess for a large number of slots the is a possibility of a large
> > > > gap in syncing the slots which probably means we need to retain
> > > > corresponding WAL for a much longer time on the primary. If we can
> > > > prove that the gap won't be large enough to matter then this would be
> > > > probably worth considering otherwise, I think we should find a way to
> > > > scale the number of workers to avoid the large gap.
> > > >
> > >
> > > How about this:
> > >
> > > 1) On standby, spawn 1 worker per database in the start (as it is
> > > doing currently).
> > >
> > > 2) Maintain statistics on activity against each primary's database on
> > > standby by any means. Could be by maintaining 'last_synced_time' and
> > > 'last_activity_seen time'. The last_synced_time is updated every time
> > > we sync/recheck slots for that particular database. The
> > > 'last_activity_seen_time' changes only if we get any slot on that
> > > database where actually confirmed_flush or say restart_lsn has changed
> > > from what was maintained already.
> > >
> > > 3) If at any moment, we find that 'last_synced_time' -
> > > 'last_activity_seen' goes beyond a threshold, that means that DB is
> > > not active currently. Add it to list of inactive DB
> > >
> >
> > I think we should also increase the next_sync_time if in current sync,
> > there is no update.
>
> +1
>
> >
> > > 4) Launcher on the other hand is always checking if it needs to spawn
> > > any other extra worker for any new DB. It will additionally check if
> > > number of inactive databases (maintained on standby) has gone higher
> > > (> some threshold), then it brings down the workers for those and
> > > starts a common worker which takes care of all such inactive databases
> > > (or merge all in 1), while workers for active databases remain as such
> > > (i.e. one per db). Each worker maintains the list of DBs which it is
> > > responsible for.
> > >
> > > 5) If in the list of these inactive databases, we again find any
> > > active database using the above logic, then the launcher will spawn a
> > > separate worker for that.
> > >
> >
> > I wonder if we anyway some sort of design like this because we
> > shouldn't allow to spawn as many workers as the number of databases.
> > There has to be some existing or new GUC like max_sync_slot_workers
> > which decided the number of workers.
> >
>
> Currently it does not have any such GUC for sync-slot workers. It
> mainly uses the logical-rep-worker framework for the sync-slot worker
> part and thus it relies on 'max_logical_replication_workers' GUC. Also
> it errors out if 'max_replication_slots' is set to zero. I think it is
> not the correct way of doing things for sync-slot. We can have a new
> GUC (max_sync_slot_workers) as you suggested and if the number of
> databases < max_sync_slot_workers, then we can start 1 worker per
> dbid, else divide the work equally among the max sync-workers
> possible. And for inactive database cases, we can increase the
> next_sync_time rather than starting a special worker to handle all the
> inactive databases. Thoughts?
>

Attaching the PoC patch (0003) where attempts to implement the basic
infrastructure for the suggested design. Rebased the existing patches
(0001 and 0002) as well.

This patch adds a new GUC max_slot_sync_workers; the default and max
value is kept at 2 and 50 respectively for this PoC patch. Now the
replication launcher divides the work equally among these many
slot-sync workers. Let us say there are multiple slots on primary
belonging to 10 DBs and say new GUC on standby is set at default value
of 2, then each worker on standby will manage 5 dbs individually and
will keep on synching the slots for them. If a new DB is found by
replication launcher, it will assign this new db to the worker
handling the minimum number of dbs currently (or first worker in case
of equal count) and that worker will pick up the new db the next time
it tries to sync the slots.
I have kept the changes in separate patches (003) for ease of review.
Since this is just a PoC patch, many things are yet to be done
appropriately, will cover those in next versions.

thanks
Shveta

Attachment Content-Type Size
v10-0001-Allow-logical-walsenders-to-wait-for-physical-st.patch application/octet-stream 21.1 KB
v10-0003-max_slot_sync_workers-GUC-based-implementation.patch application/octet-stream 33.9 KB
v10-0002-Add-logical-slot-sync-capability-to-physical-sta.patch application/octet-stream 53.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-08-01 11:31:36 Re: Synchronizing slots from primary to standby
Previous Message Matthias van de Meent 2023-08-01 11:03:29 Re: Extract numeric filed in JSONB more effectively