Re: Synchronizing slots from primary to standby

From: Ajin Cherian <itsajin(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-03-08 04:26:35
Message-ID: CAFPTHDaC6mQECXQUPUoMXkxPo+23Gwx7LeHvtdmuXKSWCMTQgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 8, 2024 at 2:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> >
> > Attach the V108 patch set which addressed above and Peter's comments.
> > I also removed the check for "*" in guc check hook.
> >
>
>
> Pushed with minor modifications. I'll keep an eye on BF.
>
> BTW, one thing that we should try to evaluate a bit more is the
> traversal of slots in StandbySlotsHaveCaughtup() where we verify if
> all the slots mentioned in standby_slot_names have received the
> required WAL. Even if the standby_slot_names list is short the total
> number of slots can be much larger which can lead to an increase in
> CPU usage during traversal. There is an optimization that allows to
> cache ss_oldest_flush_lsn and ensures that we don't need to traverse
> the slots each time so it may not hit frequently but still there is a
> chance. I see it is possible to further optimize this area by caching
> the position of each slot mentioned in standby_slot_names in
> replication_slots array but not sure whether it is worth.
>
>
>
I tried to test this by configuring a large number of logical slots while
making sure the standby slots are at the end of the array and checking if
there was any performance hit in logical replication from these searches.

Setup:
1. 1 primary server configured with 3 servers in the standby_slot_names, 1
extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

2. 1 primary server configured with 3 servers in the standby_slot_names,
100 extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

3. 1 primary server configured with 3 servers in the standby_slot_names,
500 extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

In the three setups, 3 standby_slot_names are compared with a list of 2,101
and 501 slots respectively.

I ran a pgbench for 15 minutes for all 3 setups:

Case 1: Average TPS - 8.143399 TPS
Case 2: Average TPS - 8.187462 TPS
Case 3: Average TPS - 8.190611 TPS

I see no degradation in the performance, the differences in performance are
well within the run to run variations seen.

Nisha also did some performance tests to record the lag introduced by the
large number of slots traversal in StandbySlotsHaveCaughtup(). The tests
logged time at the start and end of the XLogSendLogical() call (which
eventually calls WalSndWaitForWal() --> StandbySlotsHaveCaughtup()) and
calculated total time taken by this function during the load run for
different total slots count.

Setup:
--one primary with 3 standbys and one subscriber with one active
subscription
--hot_standby_feedback=off and sync_replication_slots=false
--made sure the standby slots remain at the end
ReplicationSlotCtl->replication_slots array to measure performance of worst
case scenario for standby slot search in StandbySlotsHaveCaughtup()

pgbench for 15 min was run. Here is the data:

Case1 : with 1 logical slot, standby_slot_names having 3 slots
Run1: 626.141642 secs
Run2: 631.930254 secs

Case2 : with 100 logical slots, standby_slot_names having 3 slots
Run1: 629.38332 secs
Run2: 630.548432 secs

Case3 : with 500 logical slots, standby_slot_names having 3 slots
Run1: 629.910829 secs
Run2: 627.924183 secs

There was no degradation in performance seen.

Thanks Nisha for helping with the testing.

regards,
Ajin Cherian
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2024-03-08 04:29:27 Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
Previous Message Erik Wienhold 2024-03-08 04:24:19 Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there