Quick Links

RE: Improving the latch handling between logical replication launcher and worker processes.

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	vignesh C <vignesh21(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	RE: Improving the latch handling between logical replication launcher and worker processes.
Date:	2024-04-26 07:52:51
Message-ID:	OS0PR01MB57165EB656A8AB92E4BE650694162@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thursday, April 25, 2024 4:59 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> Hi,
>
> Currently the launcher's latch is used for the following: a) worker process attach
> b) worker process exit and c) subscription creation.
> Since this same latch is used for multiple cases, the launcher process is not able
> to handle concurrent scenarios like: a) Launcher started a new apply worker and
> waiting for apply worker to attach and b) create subscription sub2 sending
> launcher wake up signal. In this scenario, both of them will set latch of the
> launcher process, the launcher process is not able to identify that both
> operations have occurred 1) worker is attached 2) subscription is created and
> apply worker should be started. As a result the apply worker does not get
> started for the new subscription created immediately and gets started after the
> timeout of 180 seconds.
>
> I have started a new thread for this based on suggestions at [1].
>
> a) Introduce a new latch to handle worker attach and exit.

I found the startup process also uses two latches(see recoveryWakeupLatch) for
different purposes, so maybe this is OK. But note that both logical launcher
and apply worker will call logicalrep_worker_launch(), if we only add one new
latch, it means both workers will wait on the same new Latch, and the launcher
may consume the SetLatch that should have been consumed by the apply
worker(although it's rare).

> b) Add a new GUC launcher_retry_time which gives more flexibility to users as
> suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval plays
> a similar role as the suggested new GUC launcher_retry_time, e.g. even if a
> worker is launched, the launcher only wait wal_retrieve_retry_interval time
> before next round.

IIUC, the issue does not happen frequently, and may not be noticeable where
apply workers wouldn't be restarted often. So, I am slightly not sure if it's
worth adding a new GUC.

> c) Don't reset the latch at worker attach and allow launcher main to identify and
> handle it. For this there is a patch v6-0002 available at [2].

This seems simple. I found we are doing something similar in
RegisterSyncRequest() and WalSummarizerMain().

Best Regards,
Hou zj

In response to

Improving the latch handling between logical replication launcher and worker processes. at 2024-04-25 08:58:40 from vignesh C

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Banck	2024-04-26 08:08:33	Re: New GUC autovacuum_max_threshold ?
Previous Message	Daniel Gustafsson	2024-04-26 07:40:27	Re: Improve the connection failure error messages