Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-05-25 08:59:26
Message-ID: CAGPVpCRYapb0FBYB+ZKbU+=u6CQfKBeHNeqmGAcb3y7r=upazQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Peter Smith <smithpb2250(at)gmail(dot)com>, 24 May 2023 Çar, 05:59 tarihinde şunu
yazdı:

> Hi, and thanks for the patch! It is an interesting idea.
>
> I have not yet fully read this thread, so below are only my first
> impressions after looking at patch 0001. Sorry if some of these were
> already discussed earlier.
>
> TBH the patch "reuse-workers" logic seemed more complicated than I had
> imagined it might be.
>

If you mean patch 0001 by the patch "reuse-workers", most of the complexity
comes with some refactoring to split apply worker and tablesync worker
paths. [1]
If you mean the whole patch set, then I believe it's because reusing
replication slots also requires having a proper snapshot each time the
worker moves to a new table. [2]

>
> 1.
> IIUC with patch 0001, each/every tablesync worker (a.k.a. TSW) when it
> finishes dealing with one table then goes looking to find if there is
> some relation that it can process next. So now every TSW has a loop
> where it will fight with every other available TSW over who will get
> to process the next relation.
>
> Somehow this seems all backwards to me. Isn't it strange for the TSW
> to be the one deciding what relation it would deal with next?
>
> IMO it seems more natural to simply return the finished TSW to some
> kind of "pool" of available workers and the main Apply process can
> just grab a TSW from that pool instead of launching a brand new one in
> the existing function process_syncing_tables_for_apply(). Or, maybe
> those "available" workers can be returned to a pool maintained within
> the launcher.c code, which logicalrep_worker_launch() can draw from
> instead of launching a whole new process?
>
> (I need to read the earlier posts to see if these options were already
> discussed and rejected)
>

I think ([3]) relying on a single apply worker for the assignment of
several tablesync workers might bring some overhead, it's possible that
some tablesync workers wait in idle until the apply worker assigns them
something. OTOH yes, the current approach makes tablesync workers race for
a new table to sync.
TBF both ways might be worth discussing/investigating more, before deciding
which way to go.

> 2.
> AFAIK the thing that identifies a tablesync worker is the fact that
> only TSW will have a 'relid'.
>
> But it feels very awkward to me to have a TSW marked as "available"
> and yet that LogicalRepWorker must still have some OLD relid field
> value lurking (otherwise it will forget that it is a "tablesync"
> worker!).
>
> IMO perhaps it is time now to introduce some enum 'type' to the
> LogicalRepWorker. Then an "in_use" type=TSW would have a valid 'relid'
> whereas an "available" type=TSW would have relid == InvalidOid.
>

Hmm, relid will be immediately updated when the worker moves to a new
table. And the time between finishing sync of a table and finding a new
table to sync should be minimal. I'm not sure how having an old relid for
such a small amount of time can do any harm.

> 3.
> Maybe I am mistaken, but it seems the benchmark results posted are
> only using quite a small/default values for
> "max_sync_workers_per_subscription", so I wondered how those results
> are affected by increasing that GUC. I think having only very few
> workers would cause more sequential processing, so conveniently the
> effect of the patch avoiding re-launch might be seen in the best
> possible light. OTOH, using more TSW in the first place might reduce
> the overall tablesync time because the subscriber can do more work in
> parallel.

So I'm not quite sure what the goal is here. E.g. if the user doesn't

care much about how long tablesync phase takes then there is maybe no
> need for this patch at all. OTOH, I thought if a user does care about
> the subscription startup time, won't those users be opting for a much
> larger "max_sync_workers_per_subscription" in the first place?
> Therefore shouldn't the benchmarking be using a larger number too?

Regardless of how many tablesync workers there are, reusing workers will
speed things up if a worker has a chance to sync more than one table.
Increasing the number of tablesync workers, of course, improves the
tablesync performance. But if it doesn't make 100% parallel ( meaning that
# of sync workers != # of tables to sync), then reusing workers can bring
an additional improvement.

Here are some benchmarks similar to earlier, but with 100 tables and
different number of workers:

+--------+-------------+-------------+-------------+------------+
| | 2 workers | 4 workers | 6 workers | 8 workers |
+--------+-------------+-------------+-------------+------------+
| master | 2579.154 ms | 1383.153 ms | 1001.559 ms | 911.758 ms |
+--------+-------------+-------------+-------------+------------+
| patch | 1724.230 ms | 853.894 ms | 601.176 ms | 496.395 ms |
+--------+-------------+-------------+-------------+------------+

So yes, increasing the number of workers makes it faster. But reusing
workers can still improve more.

[1]
https://www.postgresql.org/message-id/CAAKRu_YKGyF%2BsvRQqe1th-mG9xLdzneWgh9H1z1DtypBkawkkw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAGPVpCRWEVhXa7ovrhuSQofx4to7o22oU9iKtrOgAOtz_%3DY6vg%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/CAGPVpCRzD-ZZEc9ienhyrVpCzd9AJ7fxE--OFFJBnBg3E0438w%40mail.gmail.com

Best,
--
Melih Mutlu
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stavros Koureas 2023-05-25 09:10:06 Logical Replication Conflict Resolution
Previous Message Hans Buschmann 2023-05-25 08:55:57 AW: Proposal: Removing 32 bit support starting from PG17++