RE: Deadlock between logrep apply worker and tablesync worker

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Deadlock between logrep apply worker and tablesync worker
Date: 2023-02-03 07:57:57
Message-ID: OS3PR01MB5718B099C8DA6F91FDA26EEF94D79@OS3PR01MB5718.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Feb 2, 2023 at 12:05 PM houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21(at)gmail(dot)com>
> wrote:
> > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > >
> >
> > I also tried to test the time of "src/test/subscription/t/002_types.pl"
> > before and after the patch(change the lock level) and Tom's
> > patch(split
> > transaction) like what Vignesh has shared on -hackers.
> >
> > I run about 100 times for each case. Tom's and the lock level patch
> > behave similarly on my machines[1].
> >
> > HEAD: 3426 ~ 6425 ms
> > HEAD + Tom: 3404 ~ 3462 ms
> > HEAD + Vignesh: 3419 ~ 3474 ms
> > HEAD + Tom + Vignesh: 3408 ~ 3454 ms
> >
> > Even apart from the testing time reduction, reducing the lock level
> > and lock the specific object can also help improve the lock contention
> > which user(that use the exposed function) , table sync worker and
> > apply worker can also benefit from it. So, I think pushing the patch to change
> the lock level makes sense.
> >
> > And the patch looks good to me.
> >
>
> Thanks for the tests. I also see a reduction in test time variability with Vignesh's
> patch. I think we can release the locks in case the origin is concurrently
> dropped as in the attached patch. I am planning to commit this patch
> tomorrow unless there are more comments or objections.
>
> > While on it, after pushing the patch, I think there is another case
> > might also worth to be improved, that is the table sync and apply
> > worker try to drop the same origin which might cause some delay. This
> > is another case(different from the deadlock), so I feel we can try to improve
> this in another patch.
> >
>
> Right, I think that case could be addressed by Tom's patch to some extent but
> I am thinking we should also try to analyze if we can completely avoid the need
> to remove origins from both processes. One idea could be to introduce
> another relstate something like PRE_SYNCDONE and set it in a separate
> transaction before we set the state as SYNCDONE and remove the slot and
> origin in tablesync worker.
> Now, if the tablesync worker errors out due to some reason during the second
> transaction, it can remove the slot and origin after restart by checking the state.
> However, it would add another relstate which may not be the best way to
> address this problem. Anyway, that can be accomplished as a separate patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply worker. And
add a new state PRE_SYNCDONE after synchronization finished in front of apply
(sublsn set), but before dropping the origin and other final cleanups. The
tablesync will restart and redo the cleanup if it failed after reaching the new
state. Besides, since the changes can already be applied on the table in
PRE_SYNCDONE state, so I also modified the check in
should_apply_changes_for_rel(). And some other conditions for the origin drop
in subscription commands are were adjusted in this patch.

Best Regards,
Hou zj

Attachment Content-Type Size
0001-Avoid-dropping-origins-from-both-apply-and-tablesync.patch application/octet-stream 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-02-03 07:57:59 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Tom Lane 2023-02-03 07:50:38 Re: Weird failure with latches in curculio on v15