Re: Performance issues with parallelism and LIMIT

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, dilipbalaut(at)gmail(dot)com
Subject: Re: Performance issues with parallelism and LIMIT
Date: 2023-02-08 10:42:45
Message-ID: 49ce2f52-e80b-c2ff-35e7-1485f22c5441@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/1/23 14:41, David Geier wrote:
> Hi hackers,
>
> While migrating from PostgreSQL 14 to 15, we encountered the following
> performance degradation caused by commit 46846433a03dff: "shm_mq: Update
> mq_bytes_written less often", discussion in [1].
>
> The batching can make queries with a LIMIT clause run significantly
> slower compared to PostgreSQL 14, because neither the ring buffer write
> position is updated, nor the latch to inform the leader that there's
> data available is set, before a worker's queue is 1/4th full. This can
> be seen in the number of rows produced by a parallel worker. Worst-case,
> the data set is large and all rows to answer the query appear early, but
> are not big enough to fill the queue to 1/4th (e.g. when the LIMIT and
> the tuple sizes are small). Here is an example to reproduce the problem.
>

Yeah, this is a pretty annoying regression. We already can hit poor
behavior when matching rows are not distributed uniformly in the tables
(which is what LIMIT costing assumes), and this makes it more likely to
hit similar issues. A bit like when doing many HTTP requests makes it
more likely to hit at least one 99% outlier.

> ...
>
> I had a quick look at the code and I started wondering if we can't
> achieve the same performance improvement without batching by e.g.:
>
> - Only set the latch if new data is written to an empty queue.
> Otherwise, the leader should anyways keep try reading from the queues
> without waiting for the latch, so no need to set the latch again.
>
> - Reorganize struct shm_mq. There seems to be false sharing happening
> between at least mq_ring_size and the atomics and potentially also
> between the atomics. I'm wondering if the that's not the root cause of
> the "slow atomics" observed in [1]? I'm happy to do some profiling.
>
> Alternatively, we could always set the latch if numberTuples in
> ExecutePlan() is reasonably low. To do so, the DestReceiver's receive()
> method would only need an additional "force flush" argument.
>

No opinion on these options, but worth a try. Alternatively, we could
try the usual doubling approach - start with a low threshold (and set
the latch frequently), and then gradually increase it up to the 1/4.

That should work both for queries expecting only few rows and those
producing a lot of data.

>
> A slightly different but related problem is when some workers have
> already produced enough rows to answer the LIMIT query, but other
> workers are still running without producing any new rows. In that case
> the "already done" workers will stop running even though they haven't
> reached 1/4th of the queue size, because the for-loop in execMain.c
> bails out in the following condition:
>
>         if (numberTuples && numberTuples == current_tuple_count)
>             break;
>
> Subsequently, the leader will end the plan and then wait in the Gather
> node for all workers to shutdown. However, workers still running but not
> producing any new rows will never reach the following condition in
> execMain.c to check if they're supposed to stop (the shared memory queue
> dest receiver will return false on detached queues):
>
>             /*
>              * If we are not able to send the tuple, we assume the
> destination
>              * has closed and no more tuples can be sent. If that's the
> case,
>              * end the loop.
>              */
>             if (!dest->receiveSlot(slot, dest))
>                 break;
>
> Reproduction steps for this problem are below. Here the worker getting
> the first table page will be done right away, but the query takes as
> long as it takes to scan all pages of the entire table.
>

Ouch!

> ...
>
> We would need something similar to CHECK_FOR_INTERRUPTS() which returns
> a NULL slot if a parallel worker is supposed to stop execution (we could
> e.g. check if the queue got detached). Or could we amend
> CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue
> got detached?
>

That sounds reasonable, but I'm not very familiar the leader-worker
communication, so no opinion on how it should be done.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-02-08 10:45:05 meson: Non-feature feature options
Previous Message Thomas Munro 2023-02-08 10:39:36 Re: deadlock-hard flakiness