Quick Links

Re: Performance issues with parallelism and LIMIT

From:	David Geier <geidav(dot)pg(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, dilipbalaut(at)gmail(dot)com
Subject:	Re: Performance issues with parallelism and LIMIT
Date:	2025-09-02 11:38:17
Message-ID:	15351bdf-e25c-4eb6-b1d9-b95592c6df10@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Tomas!

I've finally got time again to work on PostgreSQL.

On 03.11.2023 21:48, Tomas Vondra wrote:
> On 2/22/23 13:22, Tomas Vondra wrote:
>> ...
>>
>>>> No opinion on these options, but worth a try. Alternatively, we could
>>>> try the usual doubling approach - start with a low threshold (and set
>>>> the latch frequently), and then gradually increase it up to the 1/4.
>>>>
>>>> That should work both for queries expecting only few rows and those
>>>> producing a lot of data.
>>> I was thinking about this variant as well. One more alternative would be
>>> latching the leader once a worker has produced 1/Nth of the LIMIT where
>>> N is the number of workers. Both variants have the disadvantage that
>>> there are still corner cases where the latch is set too late; but it
>>> would for sure be much better than what we have today.

Or always latching when a LIMIT is present. When a LIMIT is present,
it's much more likely that the latency hurts than that it doesn't.

>>> I also did some profiling and - at least on my development laptop with 8
>>> physical cores - the original example, motivating the batching change is
>>> slower than when it's disabled by commenting out:
>>>
>>>     if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))
>>>
>>> SET parallel_tuple_cost TO 0;
>>> CREATE TABLE b (a int);
>>> INSERT INTO b SELECT generate_series(1, 200000000);
>>> ANALYZE b;
>>> EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM b;
>>>
>>> Gather (cost=1000.00..1200284.61 rows=200375424 width=4) (actual
>>> rows=200000000 loops=1)
>>>    Workers Planned: 7
>>>    Workers Launched: 7
>>>    -> Parallel Seq Scan on b (cost=0.00..1199284.61 rows=28625061
>>> width=4) (actual rows=25000000 loops=8)
>>>
>>> Always latch: 19055 ms
>>> Batching:   19575 ms
>>>
>>> If I find some time, I'll play around a bit more and maybe propose a patch.

I've also remeasured the shared memory latching with and without the
1/4th check using the original example from [1]. Apart from the code
line mentioned by you, I also commented out the check on the consumer side:

if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)

On my dev laptop (i9-13950HX) the runtimes are pretty much the same with
8 workers (16-17 seconds with some variance). It would be great to
understand when this truly helps, if at all, to see if we need some
smartness to latch the consumer or if we can just remove the 1/4th check.

If this turns out to be more involved we could also move this discussion
into a separate thread and have this thread focus on stopping the
parallel workers early, see below.

>>
>> OK. Once you have a WIP patch maybe share it and I'll try to do some
>> profiling too.
>>
>>>>> ...
>>>>>
>>>>> We would need something similar to CHECK_FOR_INTERRUPTS() which returns
>>>>> a NULL slot if a parallel worker is supposed to stop execution (we could
>>>>> e.g. check if the queue got detached). Or could we amend
>>>>> CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue
>>>>> got detached?
>>>>>
>>>> That sounds reasonable, but I'm not very familiar the leader-worker
>>>> communication, so no opinion on how it should be done.
>>>
>>> I think an extra macro that needs to be called from dozens of places to
>>> check if parallel execution is supposed to end is the least preferred
>>> approach. I'll read up more on how CHECK_FOR_INTERRUPTS() works and if
>>> we cannot actively signal the workers that they should stop.
>>>
>>
>> IMHO if this requires adding another macro to a bunch of ad hoc places
>> is rather inconvenient. It'd be much better to fix this in a localized
>> manner (especially as it seems related to a fairly specific place).

I've written up a draft patch that instructs workers to stop, once the
leader has gotten enough rows according to the LIMIT clause. I'm using
SendProcSignal() to inform the workers to take action and stop executing
ExecutePlan(). I've implemented the stopping via sigsetjmp. I cannot see
a good way of doing this differently which is not much more intrusive.
The patch is incomplete (comments, support for Gather Merge, more
testing, etc.) but I'm mostly interested at this point if the overall
approach is deemed fine.

I first tried to use TerminateBackgroundWorker() but postmaster.c then
logs the worker termination and also some of the cleanup code needed for
proper instrumentation doesn't run any longer in the parallel workers.

With the patch applied, the query from the first mail of this thread
runs in a few milliseconds. That it still takes that long is because
forking, plan (de-)serialization and remaining initialization are fairly
heavy weight. With threads, the "fork" time would already much lower and
no (de-)serialization would be necessary. In the process-based
architecture it would be interesting to think about adding a parallel
worker pool.

>
> David, do you still plan to try fixing these issues? I have a feeling
> those issues may be fairly common but often undetected, or just brushed
> of as "slow query" (AFAICS it was only caught thanks to comparing
> timings before/after upgrade). Would be great to improve this.

I completely agree. And while they look like corner cases, if the
workload is diverse enough they will be encountered (both findings are
from the field). If it's then a query that runs frequently enough it
causes a real issue that is hard to be diagnosed by the DBA.

[1]
https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com

--
David Geier

Attachment	Content-Type	Size
0001-Parallel-workers-stop-quicker.patch	text/x-patch	8.0 KB

In response to

Re: Performance issues with parallelism and LIMIT at 2023-11-03 20:48:25 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Shubham Khanna	2025-09-02 11:42:04	Re: Add support for specifying tables in pg_createsubscriber.
Previous Message	Amit Kapila	2025-09-02 11:35:29	Re: POC: enable logical decoding when wal_level = 'replica' without a server restart