Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-08-17 10:39:54
Message-ID: CAJpy0uDJ8rV1LWgPOXwcV9Xb1ESyB_0RJ6Ksbkc4FW4JgXXUcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 17, 2023 at 11:55 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > On 8/14/23 11:52 AM, shveta malik wrote:
> >
> > >
> > > We (myself and Ajin) performed the tests to compute the lag in standby
> > > slots as compared to primary slots with different number of slot-sync
> > > workers configured.
> > >
> >
> > Thanks!
> >
> > > 3 DBs were created, each with 30 tables and each table having one
> > > logical-pub/sub configured. So this made a total of 90 logical
> > > replication slots to be synced. Then the workload was run for aprox 10
> > > mins. During this workload, at regular intervals, primary and standby
> > > slots' lsns were captured (from pg_replication_slots) and compared. At
> > > each capture, the intent was to know how much is each standby's slot
> > > lagging behind corresponding primary's slot by taking the distance
> > > between confirmed_flush_lsn of primary and standby slot. Then we took
> > > the average (integer value) of this distance over the span of 10 min
> > > workload
> >
> > Thanks for the explanations, make sense to me.
> >
> > > and this is what we got:
> > >
> > > With max_slot_sync_workers=1, average-lag = 42290.3563
> > > With max_slot_sync_workers=2, average-lag = 24585.1421
> > > With max_slot_sync_workers=3, average-lag = 14964.9215
> > >
> > > This shows that more workers have better chances to keep logical
> > > replication slots in sync for this case.
> > >
> >
> > Agree.
> >
> > > Another statistics if it interests you is, we ran a frequency test as
> > > well (this by changing code, unit test sort of) to figure out the
> > > 'total number of times synchronization done' with different number of
> > > sync-slots workers configured. Same 3 DBs setup with each DB having 30
> > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2
> > > and 3; total number of times synchronization done was 15874, 20205 and
> > > 23414 respectively. Note: this is not on the same machine where we
> > > captured lsn-gap data, it is on a little less efficient machine but
> > > gives almost the same picture
> > >
> > > Next we are planning to capture this data for a lesser number of slots
> > > like 10,30,50 etc. It may happen that the benefit of multi-workers
> > > over single workers in such cases could be less, but let's have the
> > > data to verify that.
> > >
> >
> > Thanks a lot for those numbers and for the testing!
> >
> > Do you think it would make sense to also get the number of using
> > the pg_failover_slots module? (and compare the pg_failover_slots numbers with the
> > "one worker" case here). Idea is to check if the patch does introduce
> > some overhead as compare to pg_failover_slots.
> >
>
> Yes, definitely. We will work on that and share the numbers soon.
>

We are working on these tests. Meanwhile attaching the patches which
attempt to implement below functionalities:

1) Remove extra slots on standby if those no longer exist on primary
or are no longer part of synchronize_slot_names.
2) Make synchronize_slot_names user-modifiable. And due to change in
'synchronize_slot_names', if dbids list is reduced, then take care of
removal of extra/old db-ids (if any) from workers db-list.

Thanks Ajin for working on 1. Both the above changes are in
patch-0002. There is a test failure in the recovery module due to
these new changes, I am looking into it and will fix it in the next
version.

Improvements in pipeline:
a) standby slots should not be consumable.
b) optimization of query which standby sends to primary. Currently it
has dbid filter and slot-name filter. Slot-name filter can be
optimized to have only those slots which belong to DBs assigned to the
worker rather than having all 'synchronize_slot_names'.
c) analyze if the naptime of the slot-sync worker can be auto-tuned.
If there is no activity going on (i.e. slots are not advancing on
primary) then increase naptime of slot-sync worker on standby and
decrease it again when activity starts.

thanks
Shveta

Attachment Content-Type Size
v12-0001-Allow-logical-walsenders-to-wait-for-physical-st.patch application/octet-stream 21.1 KB
v12-0002-Add-logical-slot-sync-capability-to-physical-sta.patch application/octet-stream 71.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2023-08-17 11:18:58 Re: meson: Non-feature feature options
Previous Message a.rybakina 2023-08-17 10:20:33 Re: POC, WIP: OR-clause support for indexes