| From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, dilipbalaut(at)gmail(dot)com |
| Subject: | Re: Performance issues with parallelism and LIMIT |
| Date: | 2025-11-18 12:07:38 |
| Message-ID: | 782fd9d0-99dc-4880-adcc-2df93557633d@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Tomas!
On 14.11.2025 17:00, Tomas Vondra wrote:
> On 11/13/25 23:36, Tomas Vondra wrote:
>> ...
>>
>> What I think we should do is much simpler - make the threshold in shm_mq
>> dynamic, start with a very low value and gradually ramp up (up to 1/4).
>> So we'd have
>>
>> if (mqh->mqh_consume_pending > threshold)
>>
>> We might start with
>>
>> threshold = (mq->mq_ring_size / 1024)
>>
>> or maybe some fixed value, list
>>
>> thredhold = 128
>>
>> And on every signal we'd double it, capping it to 1/4 of mq_ring_size.
>>
>> threshold = Min(threshold * 2, mq->mq_ring_size / 1024);
>>
>> This is very similar to other places doing this gradual ramp up, like in
>> the prefetching / read_stream, etc. It allows fast termination for low
>> LIMIT values, but quickly amortizes the cost for high LIMIT values.
>>
>
> I gave this a try today, to see if it can actually solve the regression.
> Attached is a WIP patch, and a set of benchmarking scripts. On my ryzen
> machine I got this (timings of the queries):
>
> fill dataset | 14 15 16 17 18 patched
> -----------------------------------------------------------------
> 10 random | 64.1 319.3 328.7 340.5 344.3 79.5
> sequential | 54.6 323.4 347.5 350.5 399.2 78.3
> 100 random | 11.8 42.9 42.3 42.3 68.5 18.6
> sequential | 10.0 44.3 45.0 44.3 60.6 20.0
>
> Clearly 15 is a significant regression, with timings ~4x higher. And the
> patch improves that quite a bit. It's not down all the way back to 14,
> there's still ~10ms regression, for some reason.
>
> Also, I didn't measure if this patch causes some other regressions for
> other queries. I don't think it does, but maybe there's some weird
> corner case affected.
>
>
> regards
>
Thanks for working on that. This is certainly an improvement. It doesn't
work always though. You can still get into the situation where enough
data is waiting in the queues to satisfy the limit but the threshold
hasn't been reached and also won't be reached anymore because no more
rows will match. I'm especially passionate about that case because
currently you can get arbitrarily bad query runtimes with big data sets
and small LIMITs.
As shared previously in this thread, I cannot reproduce any slowdown
when deactivating the late latching. The test used a very narrow row
(single INT) and the data set fit into shared memory. I've only tried
with 8 parallel workers. Could you test if you can reproduce the
slowdown, in case you have a machine with more cores at hand? If we can
somehow reproduce the original problem, I would also like to check if
there's not other issues at play that can be fixed differently (e.g.
false sharing).
If that optimization is truly necessary, how about always latching if a
LIMIT clause is present? Or in the presence of a LIMIT clause, keeping
the row count of totally produced rows in shared memory and latching in
all workers once the LIMIT has been reached? The overhead of changing
the shared atomic should be neglectable for reasonable LIMITs. Another
alternative would be periodically latching. Given that the minimum
runtime of any parallel query is a few dozen milliseconds due to forking
and plan (de-)serialization, we could live with latching only say every
millisecond or so.
--
David Geier
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Vaibhav Dalvi | 2025-11-18 12:59:36 | Re: [PATCH] Add pg_get_subscription_ddl() function |
| Previous Message | Ilmar Y | 2025-11-18 11:56:50 | Re: Ambiguity in IS JSON description and logic |