Re: Deadlock between logrep apply worker and tablesync worker

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deadlock between logrep apply worker and tablesync worker
Date: 2023-02-02 11:21:14
Message-ID: CAA4eK1KXDanJqxC_qrUqxEu1VbvySBn8qcCK+2zM0CAb+g2boQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 2, 2023 at 12:05 PM houzj(dot)fnst(at)fujitsu(dot)com
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
>
> I also tried to test the time of "src/test/subscription/t/002_types.pl"
> before and after the patch(change the lock level) and Tom's patch(split
> transaction) like what Vignesh has shared on -hackers.
>
> I run about 100 times for each case. Tom's and the lock level patch
> behave similarly on my machines[1].
>
> HEAD: 3426 ~ 6425 ms
> HEAD + Tom: 3404 ~ 3462 ms
> HEAD + Vignesh: 3419 ~ 3474 ms
> HEAD + Tom + Vignesh: 3408 ~ 3454 ms
>
> Even apart from the testing time reduction, reducing the lock level and lock
> the specific object can also help improve the lock contention which user(that
> use the exposed function) , table sync worker and apply worker can also benefit
> from it. So, I think pushing the patch to change the lock level makes sense.
>
> And the patch looks good to me.
>

Thanks for the tests. I also see a reduction in test time variability
with Vignesh's patch. I think we can release the locks in case the
origin is concurrently dropped as in the attached patch. I am planning
to commit this patch tomorrow unless there are more comments or
objections.

> While on it, after pushing the patch, I think there is another case might also
> worth to be improved, that is the table sync and apply worker try to drop the
> same origin which might cause some delay. This is another case(different from
> the deadlock), so I feel we can try to improve this in another patch.
>

Right, I think that case could be addressed by Tom's patch to some
extent but I am thinking we should also try to analyze if we can
completely avoid the need to remove origins from both processes. One
idea could be to introduce another relstate something like
PRE_SYNCDONE and set it in a separate transaction before we set the
state as SYNCDONE and remove the slot and origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the
second transaction, it can remove the slot and origin after restart by
checking the state. However, it would add another relstate which may
not be the best way to address this problem. Anyway, that can be
accomplished as a separate patch.

--
With Regards,
Amit Kapila.

Attachment Content-Type Size
v4-0001-Optimize-the-origin-drop-functionality.patch application/octet-stream 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-02-02 11:31:09 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message Jakub Wartak 2023-02-02 11:12:44 Re: Syncrep and improving latency due to WAL throttling