Re: Conflict detection for update_deleted in logical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2025-07-06 14:50:42
Message-ID: CAD21AoBW5oO_PZv9xnz_HQnWQkgFE=Hu_aN2=CtXqbKbSZEWQw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 6, 2025 at 8:03 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear hackers,
>
> As a confirmation purpose, I did performance testing with four workloads
> we did before.

Thank you for doing the performance tests!

>
> 03. pgbench on both sides
> ========================
> The workload is mostly same as [3].
>
> Workload:
> - Ran pgbench with 40 clients for the *both side*.
> - The duration was 120s, and the measurement was repeated 10 times.
>
> (bothtest.tar.gz can run the same workload)
>
> Test Scenarios & Results:
> Publisher:
> - pgHead : Median TPS = 16799.67659
> - pgHead + patch : Median TPS = 17338.38423
> Subscriber:
> - pgHead : Median TPS = 16552.60515
> - pgHead + patch : Median TPS = 8367.133693

My first impression is that 40 clients is a small number at which a
50% performance degradation occurs in 120s. Did you test how many
clients are required to trigger the same level performance regression
with retain_conflict_info = off?

>
> 04. pgbench on both side, and max_conflict_retention_duration was tuned
> ========================================================================
> The workload is mostly same as [4].
>
> Workload:
> - Initially ran pgbench with 40 clients for the *both side*.
> - Set max_conflict_retention_duration = {60, 120}
> - When the slot is invalidated on the subscriber side, stop the benchmark and
> wait until the subscriber would be caught up. Then the number of clients on
> the publisher would be half.
> In this test the conflict slot could be invalidated as expected when the workload
> on the publisher was high, and it would not get invalidated anymore after
> reducing the workload. This shows even if the slot has been invalidated once,
> users can continue to detect the update_deleted conflict by reduce the
> workload on the publisher.
> - Total period of the test was 900s for each cases.
>
> (max_conflixt.tar.gz can run the same workload)
>
> Observation:
> -
> - Parallelism of the publisher side is reduced till 15->7->3 and finally the
> conflict slot is not invalidated.
> - TPS on the subscriber side is improved when the concurrency was reduced.
> This is because the dead tuple accumulation is reduced on subscriber due to
> the reduced workload on the publisher.
> - when publisher has Nclients=3, no regression in subscriber's TPS

I think that users typically cannot control the amount of workloads in
production, meaning that once the performance regression starts to
happen the subscriber could enter the loop where invalidating the
slot, recovreing the performance, creating the slot, and having the
performance problem.

>
> Detailed Results Table:
> For max_conflict_retention_duration = 60s
> On publisher:
> Nclients duration [s] TPS
> 15 72 14079.1
> 7 82 9307
> 3 446 4133.2
>
> On subscriber:
> Nclients duration [s] TPS
> 15 72 6827
> 15 81 7200
> 15 446 19129.4
>
>
> For max_conflict_retention_duration = 120s
> On publisher:
> Nclients duration [s] TPS
> 15 162 17835.3
> 7 152 9503.8
> 3 283 4243.9
>
>
> On subscriber:
> Nclients duration [s] TPS
> 15 162 4571.8
> 15 152 4707
> 15 283 19568.4

What does each duration mean in these results? Can we interpret the
test case of max_conflict_retention_duration=120s that when 7 clients
and 15 clients are working on the publisher and the subscriber
respectively, the TPS on the subscriber was about one fourth (17835.3
vs. 4707)?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-07-06 15:03:05 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message Masahiko Sawada 2025-07-06 14:50:38 Re: Conflict detection for update_deleted in logical replication