logicalrep_worker_launch -- counting/checking the worker limits

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: logicalrep_worker_launch -- counting/checking the worker limits
Date: 2023-08-11 08:58:35
Message-ID: CAHut+Pu-7DiCwMdbBf5vzUL1t8H=_NpyFR4RvpZbpMGXV=+QLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While reviewing other threads I have been looking more closely at the
the logicalrep_worker_launch() function. IMO the logic of that
function seems not quite right.

Here are a few things I felt are strange:

1. The function knows exactly what type of worker it is launching, but
still, it is calling the worker counting functions
logicalrep_sync_worker_count() and logicalrep_pa_worker_count() even
when launching a worker of a *different* type.

1a. I think should only count/check the tablesync worker limit when
trying to launch a tablesync worker

1b. I think should only count/check the parallel apply worker limit
when trying to launch a parallel apply worker

~

2. There is some condition for attempting the garbage-collection of workers:

/*
* If we didn't find a free slot, try to do garbage collection. The
* reason we do this is because if some worker failed to start up and its
* parent has crashed while waiting, the in_use state was never cleared.
*/
if (worker == NULL || nsyncworkers >= max_sync_workers_per_subscription)

The inclusion of that nsyncworkers check here has very subtle
importance. AFAICT this means that even if we *did* find a free
worker, we still need to do garbage collection just in case one of
those 'in_use' tablesync worker was in error (e.g. crashed after
marked in_use). By garbage-collecting (and then re-counting
nsyncworkers) we might be able to launch the tablesync successfully
instead of just returning that it has maxed out.

2a. IIUC that is all fine. The problem is that I think there should be
exactly the same logic for the parallel apply workers here also.

2b. The comment above should explain more about the reason for the
nsyncworkers condition -- the existing comment doesn't really cover
it.

~

3. There is a wrong cut/paste comment in the body of
logicalrep_sync_worker_count().

That comment should be changed to read similarly to the equivalent
comment in logicalrep_pa_worker_count.

------

PSA a patch to address all these items.

This patch is about making the function logically consistent. Removing
some of the redundant countings should also be more efficient in
theory, but in practice, I think the unnecessary worker loops are too
short (max_logical_replication_workers) for any performance
improvements to be noticeable.

Thoughts?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
v1-0001-logicalrep_worker_launch-limit-checks.patch application/octet-stream 3.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-08-11 09:33:44 Re: Adding a LogicalRepWorker type field
Previous Message Pavel Stehule 2023-08-11 06:34:09 Re: proposal: psql: show current user in prompt