Re: Logrep launcher race conditions leading to slow tests

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Logrep launcher race conditions leading to slow tests
Date: 2025-06-25 05:33:39
Message-ID: CAFiTN-sDAYW6Ygq1_LNkkaXGFo9w-8jxnQCUtFq99Le5-xHLzg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> >> latch event so that it can keep waiting for the worker to launch.
> >> It neglects to set the latch again, allowing ApplyLauncherMain
> >> to miss events.
>
> > There was a previous discussion to fix this behavior. Heikki has
> > proposed a similar fix for this, but at the caller. See the patch
> > attached in email [1].
>
> Ah, thank you for that link. I vaguely recalled that we'd discussed
> this strange behavior before, but I did not realize that anyone had
> diagnosed a cause. I don't much like any of the patches proposed
> in that thread though --- they seem overcomplicated or off-point.
>
> I do not think we can switch to having two latches here. The
> only reason WaitForReplicationWorkerAttach needs to pay attention
> to the process latch at all is that it needs to service
> CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
> also true in the ApplyLauncherMain loop. We can't realistically
> expect other processes to signal different latches depending on
> where the launcher is waiting, so those cases have to be triggered
> by the same latch.
>
> However, WaitForReplicationWorkerAttach can't service any
> latch-setting conditions other than those managed by
> CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns,
> we have some other triggering condition, which is the business
> of some outer code level to deal with. Having it re-set the latch
> to allow that to happen promptly after it returns seems like a
> pretty straightforward answer to me.

Yeah that makes sense, we can not simply wait on another latch which
is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch
looks simple for resolving this issue.

--
Regards,
Dilip Kumar
Google

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-06-25 05:56:51 Re: SQL:2023 JSON simplified accessor support
Previous Message Zhijie Hou (Fujitsu) 2025-06-25 05:27:27 RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly