Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-08-14 10:07:43
Message-ID: CAA4eK1KyHfVOVeio28p8CHDnuyKuej78cj_7U9mHAB4ictVQwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2023 at 10:15 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Aug 9, 2023 at 8:28 AM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Thursday, August 3, 2023 7:30 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
> >
> > > Right. I attached the v26 as you asked.
> >
> > Thanks for posting the patches.
> >
> > While reviewing the patch, I noticed one rare case that it's possible that there
> > are two table sync worker for the same table in the same time.
> >
> > The patch relies on LogicalRepWorkerLock to prevent concurrent access, but the
> > apply worker will start a new worker after releasing the lock. So, at the point[1]
> > where the lock is released and the new table sync worker has not been started,
> > it seems possible that another old table sync worker will be reused for the
> > same table.
> >
> > /* Now safe to release the LWLock */
> > LWLockRelease(LogicalRepWorkerLock);
> > *[1]
> > /*
> > * If there are free sync worker slot(s), start a new sync
> > * worker for the table.
> > */
> > if (nsyncworkers < max_sync_workers_per_subscription)
> > ...
> > logicalrep_worker_launch(MyLogicalRepWorker->dbid,
> >
>
> Yeah, this is a problem. I think one idea to solve this is by
> extending the lock duration till we launch the tablesync worker but we
> should also consider changing this locking scheme such that there is a
> better way to indicate that for a particular rel, tablesync is in
> progress. Currently, the code in TablesyncWorkerMain() also acquires
> the lock in exclusive mode even though the tablesync for a rel is in
> progress which I guess could easily heart us for larger values of
> max_logical_replication_workers. So, that could be another motivation
> to think for a different locking scheme.
>

Yet another problem is that currently apply worker maintains a hash
table for 'last_start_times' to avoid restarting the tablesync worker
immediately upon error. The same functionality is missing while
reusing the table sync worker. One possibility is to use a shared hash
table to remember start times but I think it may depend on what we
decide to solve the previous problem reported by Hou-San.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2023-08-14 10:09:52 Re: pgbench with libevent?
Previous Message shveta malik 2023-08-14 09:52:24 Re: Synchronizing slots from primary to standby