Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-08-14 11:59:30
Message-ID: CALDaNm3HogBpb3O00kF3HG1xNm=m6pMxtmV4Bu=Gq56icOofdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 10 Aug 2023 at 10:16, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Aug 9, 2023 at 8:28 AM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Thursday, August 3, 2023 7:30 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
> >
> > > Right. I attached the v26 as you asked.
> >
> > Thanks for posting the patches.
> >
> > While reviewing the patch, I noticed one rare case that it's possible that there
> > are two table sync worker for the same table in the same time.
> >
> > The patch relies on LogicalRepWorkerLock to prevent concurrent access, but the
> > apply worker will start a new worker after releasing the lock. So, at the point[1]
> > where the lock is released and the new table sync worker has not been started,
> > it seems possible that another old table sync worker will be reused for the
> > same table.
> >
> > /* Now safe to release the LWLock */
> > LWLockRelease(LogicalRepWorkerLock);
> > *[1]
> > /*
> > * If there are free sync worker slot(s), start a new sync
> > * worker for the table.
> > */
> > if (nsyncworkers < max_sync_workers_per_subscription)
> > ...
> > logicalrep_worker_launch(MyLogicalRepWorker->dbid,
> >
>
> Yeah, this is a problem. I think one idea to solve this is by
> extending the lock duration till we launch the tablesync worker but we
> should also consider changing this locking scheme such that there is a
> better way to indicate that for a particular rel, tablesync is in
> progress. Currently, the code in TablesyncWorkerMain() also acquires
> the lock in exclusive mode even though the tablesync for a rel is in
> progress which I guess could easily heart us for larger values of
> max_logical_replication_workers. So, that could be another motivation
> to think for a different locking scheme.

There are couple of ways in which this issue can be solved:
Approach #1) check that the reuse worker has not picked up this table
for table sync from logicalrep_worker_launch while holding a lock on
LogicalRepWorkerLock, if the reuse worker has already picked it up for
processing, simply ignore it and return, nothing has to be done by the
launcher in this case.
Approach #2) a) Applyworker to create a shared memory of all the
relations that need to be synced, b) tablesync worker to take a lock
on this shared memory and pick the next table to be
processed(tablesync worker need not get the subscription relations
again and again) c) tablesync worker to update the status in shared
memory for the relation(since the lock is held there will be no
concurrency issues), also mark the start time in the shared memory,
this will help in not to restart the failed table before
wal_retrieve_retry_interval has expired d) tablesync worker to sync
the table e) subscription relation will be marked as ready and the
tablesync worker to remove the entry from shared memory f) Applyworker
will periodically synchronize the shared memory relations to keep it
in sync with the fetched subscription relation tables g) when apply
worker exits, the shared memory will be cleared.

Approach #2) will also help in solving the other issue reported by Amit at [1].
I felt we can use Approach #2 to solve the problem as it solves both
the reported issues and also there is an added advantage where the
re-use table sync worker need not scan the pg_subscription_rel to get
the non-ready table for every run, instead we can use the list
prepared by apply worker.
Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KyHfVOVeio28p8CHDnuyKuej78cj_7U9mHAB4ictVQwQ%40mail.gmail.com

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-08-14 12:51:43 Re: proposal: jsonb_populate_array
Previous Message Jian Guo 2023-08-14 11:12:07 Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500