Performance issues with parallelism and LIMIT

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, dilipbalaut(at)gmail(dot)com
Subject: Performance issues with parallelism and LIMIT
Date: 2023-02-01 13:41:02
Message-ID: 2c4686b2-635b-8cb8-8f22-03e1fa336f14@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

While migrating from PostgreSQL 14 to 15, we encountered the following
performance degradation caused by commit 46846433a03dff: "shm_mq: Update
mq_bytes_written less often", discussion in [1].

The batching can make queries with a LIMIT clause run significantly
slower compared to PostgreSQL 14, because neither the ring buffer write
position is updated, nor the latch to inform the leader that there's
data available is set, before a worker's queue is 1/4th full. This can
be seen in the number of rows produced by a parallel worker. Worst-case,
the data set is large and all rows to answer the query appear early, but
are not big enough to fill the queue to 1/4th (e.g. when the LIMIT and
the tuple sizes are small). Here is an example to reproduce the problem.

CREATE TABLE t(id1 INT, id2 INT, id3 INT, id4 INT, id5 INT);
INSERT INTO t(id1, id2, id3, id4, id5) SELECT i%1000, i, i, i, i FROM
generate_series(1, 10000000) AS i;
ANALYZE t;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT id2 FROM t WHERE id1 = 100 LIMIT 100;

PostgreSQL 15:

 Limit  (cost=0.00..797.43 rows=100 width=4) (actual
time=65.083..69.207 rows=100 loops=1)
   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9947 width=4) (actual
time=65.073..68.417 rows=100 loops=1)
         Output: id2
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on public.t (cost=0.00..79320.18
rows=1243 width=4) (actual time=0.204..33.049 rows=100 loops=7)
               Output: id2
               Filter: (t.id1 = 100)
               Rows Removed by Filter: 99345
               Worker 0:  actual time=0.334..32.284 rows=100 loops=1
               Worker 1:  actual time=0.060..32.680 rows=100 loops=1
               Worker 2:  actual time=0.637..33.954 rows=98 loops=1
               Worker 3:  actual time=0.136..33.301 rows=100 loops=1
               Worker 4:  actual time=0.140..31.942 rows=100 loops=1
               Worker 5:  actual time=0.062..33.673 rows=100 loops=1
               Worker 6:  actual time=0.062..33.512 rows=100 loops=1
 Planning Time: 0.113 ms
 Execution Time: 69.772 ms

PostgreSQL 14:

 Limit  (cost=0.00..797.75 rows=100 width=4) (actual
time=30.602..38.459 rows=100 loops=1)
   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9943 width=4) (actual
time=30.592..37.669 rows=100 loops=1)
         Output: id2
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on public.t (cost=0.00..79320.18
rows=1243 width=4) (actual time=0.221..5.181 rows=15 loops=7)
               Output: id2
               Filter: (t.id1 = 100)
               Rows Removed by Filter: 15241
               Worker 0:  actual time=0.129..4.840 rows=15 loops=1
               Worker 1:  actual time=0.125..4.924 rows=15 loops=1
               Worker 2:  actual time=0.314..5.249 rows=17 loops=1
               Worker 3:  actual time=0.252..5.341 rows=15 loops=1
               Worker 4:  actual time=0.163..5.179 rows=15 loops=1
               Worker 5:  actual time=0.422..5.248 rows=15 loops=1
               Worker 6:  actual time=0.139..5.489 rows=16 loops=1
 Planning Time: 0.084 ms
 Execution Time: 38.880 ms

I had a quick look at the code and I started wondering if we can't
achieve the same performance improvement without batching by e.g.:

- Only set the latch if new data is written to an empty queue.
Otherwise, the leader should anyways keep try reading from the queues
without waiting for the latch, so no need to set the latch again.

- Reorganize struct shm_mq. There seems to be false sharing happening
between at least mq_ring_size and the atomics and potentially also
between the atomics. I'm wondering if the that's not the root cause of
the "slow atomics" observed in [1]? I'm happy to do some profiling.

Alternatively, we could always set the latch if numberTuples in
ExecutePlan() is reasonably low. To do so, the DestReceiver's receive()
method would only need an additional "force flush" argument.

A slightly different but related problem is when some workers have
already produced enough rows to answer the LIMIT query, but other
workers are still running without producing any new rows. In that case
the "already done" workers will stop running even though they haven't
reached 1/4th of the queue size, because the for-loop in execMain.c
bails out in the following condition:

        if (numberTuples && numberTuples == current_tuple_count)
            break;

Subsequently, the leader will end the plan and then wait in the Gather
node for all workers to shutdown. However, workers still running but not
producing any new rows will never reach the following condition in
execMain.c to check if they're supposed to stop (the shared memory queue
dest receiver will return false on detached queues):

            /*
             * If we are not able to send the tuple, we assume the
destination
             * has closed and no more tuples can be sent. If that's the
case,
             * end the loop.
             */
            if (!dest->receiveSlot(slot, dest))
                break;

Reproduction steps for this problem are below. Here the worker getting
the first table page will be done right away, but the query takes as
long as it takes to scan all pages of the entire table.

CREATE TABLE bar (col INT);
INSERT INTO bar SELECT generate_series(1, 5000000);
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT col FROM bar WHERE col = 1 LIMIT 1;

 Limit  (cost=0.00..1.10 rows=1 width=4) (actual time=32.289..196.200
rows=1 loops=1)
   Output: col
   ->  Gather  (cost=0.00..30939.03 rows=28208 width=4) (actual
time=32.278..196.176 rows=1 loops=1)
         Output: col
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on public.bar (cost=0.00..30939.03
rows=3526 width=4) (actual time=137.251..137.255 rows=0 loops=7)
               Output: col
               Filter: (bar.col = 1)
               Rows Removed by Filter: 713769
               Worker 0:  actual time=160.177..160.181 rows=0 loops=1
               Worker 1:  actual time=160.111..160.115 rows=0 loops=1
               Worker 2:  actual time=0.043..0.047 rows=1 loops=1
               Worker 3:  actual time=160.040..160.044 rows=0 loops=1
               Worker 4:  actual time=160.167..160.171 rows=0 loops=1
               Worker 5:  actual time=160.018..160.022 rows=0 loops=1
               Worker 6:  actual time=160.201..160.205 rows=0 loops=1
 Planning Time: 0.087 ms
 Execution Time: 196.247 ms

We would need something similar to CHECK_FOR_INTERRUPTS() which returns
a NULL slot if a parallel worker is supposed to stop execution (we could
e.g. check if the queue got detached). Or could we amend
CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue
got detached?

Jasper Smit, Spiros Agathos and Dimos Stamatakis helped working on this.

[1]
https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com

--
David Geier
(ServiceNow)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ilya Gladyshev 2023-02-01 14:21:00 Re: Progress report of CREATE INDEX for nested partitioned tables
Previous Message Jakub Wartak 2023-02-01 13:40:18 Re: Syncrep and improving latency due to WAL throttling