Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-07-27 06:43:44
Message-ID: CAJpy0uCT+npL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy
> > > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > > >
> > > > Is having one (or a few more - not
> > > > necessarily one for each logical slot) worker for all logical slots
> > > > enough?
> > > >
> > >
> > > I guess for a large number of slots the is a possibility of a large
> > > gap in syncing the slots which probably means we need to retain
> > > corresponding WAL for a much longer time on the primary. If we can
> > > prove that the gap won't be large enough to matter then this would be
> > > probably worth considering otherwise, I think we should find a way to
> > > scale the number of workers to avoid the large gap.
> > >
> >
> > How about this:
> >
> > 1) On standby, spawn 1 worker per database in the start (as it is
> > doing currently).
> >
> > 2) Maintain statistics on activity against each primary's database on
> > standby by any means. Could be by maintaining 'last_synced_time' and
> > 'last_activity_seen time'. The last_synced_time is updated every time
> > we sync/recheck slots for that particular database. The
> > 'last_activity_seen_time' changes only if we get any slot on that
> > database where actually confirmed_flush or say restart_lsn has changed
> > from what was maintained already.
> >
> > 3) If at any moment, we find that 'last_synced_time' -
> > 'last_activity_seen' goes beyond a threshold, that means that DB is
> > not active currently. Add it to list of inactive DB
> >
>
> I think we should also increase the next_sync_time if in current sync,
> there is no update.

+1

>
> > 4) Launcher on the other hand is always checking if it needs to spawn
> > any other extra worker for any new DB. It will additionally check if
> > number of inactive databases (maintained on standby) has gone higher
> > (> some threshold), then it brings down the workers for those and
> > starts a common worker which takes care of all such inactive databases
> > (or merge all in 1), while workers for active databases remain as such
> > (i.e. one per db). Each worker maintains the list of DBs which it is
> > responsible for.
> >
> > 5) If in the list of these inactive databases, we again find any
> > active database using the above logic, then the launcher will spawn a
> > separate worker for that.
> >
>
> I wonder if we anyway some sort of design like this because we
> shouldn't allow to spawn as many workers as the number of databases.
> There has to be some existing or new GUC like max_sync_slot_workers
> which decided the number of workers.
>

Currently it does not have any such GUC for sync-slot workers. It
mainly uses the logical-rep-worker framework for the sync-slot worker
part and thus it relies on 'max_logical_replication_workers' GUC. Also
it errors out if 'max_replication_slots' is set to zero. I think it is
not the correct way of doing things for sync-slot. We can have a new
GUC (max_sync_slot_workers) as you suggested and if the number of
databases < max_sync_slot_workers, then we can start 1 worker per
dbid, else divide the work equally among the max sync-workers
possible. And for inactive database cases, we can increase the
next_sync_time rather than starting a special worker to handle all the
inactive databases. Thoughts?

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-07-27 06:54:19 Re: Synchronizing slots from primary to standby
Previous Message Amit Kapila 2023-07-27 05:24:57 Re: Synchronizing slots from primary to standby