Improving scalability of Parallel Bitmap Heap/Index Scan

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Improving scalability of Parallel Bitmap Heap/Index Scan
Date: 2022-07-14 10:13:38
Message-ID: CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

While debugging some slow queries containing Bitmap Heap/Index Scans (in
short BHS / BIS), we observed a few issues regarding scalability:

1. The BIS always only runs in a single process, also when the parent
BHS is parallel. The first process arriving in the BHS serves as leader and
executes the BIS.
2. As long as execution is "exact" (TIDs are stored instead of page
bits), the parallel BHS sorts all TIDs to ensure pages are accessed
sequentially. The sort is also performed just by a single worker. Already
with a few tens of thousands of pages to scan, the sort time can make up a
significant portion of the total runtime. Large page counts and the need
for parallelism are not uncommon for BHS, as one use case is closing the
gap between index and sequential scans. The BHS costing seems to not
account for that.
3. The BHS does not scale well with an increasing number of parallel
workers, even when accounting for the sequential parts of execution. A perf
profile shows that the TID list / bitmap iteration code heavily contents on
a mutex taken for every single TID / page bit (see
LWLockAcquire(&istate->lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the
statistics of the parallel workers. For example the number of heap pages
processed is what just the leader did. Similarly to other parallel plan
nodes we should aggregate statistics across workers.

The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for
different numbers of workers. I had to obfuscate the query slightly. The
difference between the startup time of the BHS and the BIS is the time it
takes to sort the TID list. The self time of the BHS is just the time spent
on processing the shared TID list and processing the pages. That part runs
in parallel but does not scale.

Workers | Total runtime | Startup time BIS | Startup time BHS | Self time
BHS (excl. sorting)
-------------------------------------------|------------------|------------------------------
2 | 15322 ms | 3107 ms | 5912 ms | 9269 ms
4 | 13277 ms | 3094 ms | 5869 ms | 7260 ms
8 | 14628 ms | 3106 ms | 5882 ms | 8598 ms

None of this is really new and some of it is even documented. So, what I am
more wondering about is why things are the way they are and how hard it
would be to change them. I am especially curious about:

- What stops us from extending the BIS to run in parallel? Parallel
Bitmap Index Scans are also supported.
- What about reducing the sort time by, e.g.
- dividing TIDs across workers, ending up with N parallely sorted
streams,
- cooperatively sorting the TIDs with multiple workers using barriers
for synchronization,
- optimizing the PagetableEntry data structure for size and using a
faster sorting algorithm like e.g. radix sort
- a combination of the first three options
- With separate TID lists per worker process the iteration problem would
be solved. Otherwise, we could
- optimize the iteration code and thereby minimize the duration of
the critical section,
- have worker processes acquire chunks of TIDs / page bits to reduce
locking.

Is there interest in patches improving on the above mentioned shortcomings?
If so, which options do you deem best?

--
David Geier
(ServiceNow)

-- 2 workers

Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
Output: count(*)
-> Gather (actual time=15187.942..15321.345 rows=2 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual time=15181.486..15181.488 rows=1
loops=2)
Output: PARTIAL count(*)
Worker 0: actual time=15181.364..15181.366 rows=1 loops=1
Worker 1: actual time=15181.608..15181.610 rows=1 loops=1
-> Parallel Bitmap Heap Scan on foo (actual
time=5912.731..15166.992 rows=269713 loops=2)
Filter: ...
Rows Removed by Filter: 4020149
Worker 0: actual time=5912.498..15166.936 rows=269305
loops=1
Worker 1: actual time=5912.963..15167.048 rows=270121
loops=1
-> Bitmap Index Scan on foo_idx (actual
time=3107.947..3107.948 rows=8579724 loops=1)
Index Cond: -
Worker 1: actual time=3107.947..3107.948
rows=8579724 loops=1
Planning Time: 0.167 ms
Execution Time: 15322.081 ms

-- 4 workers

Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
Output: count(*)
-> Gather (actual time=13137.981..13276.403 rows=4 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 4
Workers Launched: 4
-> Partial Aggregate (actual time=13130.344..13130.346 rows=1
loops=4)
Output: PARTIAL count(*)
Worker 0: actual time=13129.363..13129.365 rows=1 loops=1
Worker 1: actual time=13130.085..13130.087 rows=1 loops=1
Worker 2: actual time=13130.634..13130.635 rows=1 loops=1
Worker 3: actual time=13131.295..13131.298 rows=1 loops=1
-> Parallel Bitmap Heap Scan on foo (actual
time=5870.026..13120.579 rows=134856 loops=4)
Filter: ...
Rows Removed by Filter: 2010074
Worker 0: actual time=5869.033..13120.453 rows=128270
loops=1
Worker 1: actual time=5869.698..13118.811 rows=135333
loops=1
Worker 2: actual time=5870.465..13121.189 rows=137695
loops=1
Worker 3: actual time=5870.907..13121.864 rows=138128
loops=1
-> Bitmap Index Scan on foo_idx (actual
time=3094.585..3094.586 rows=8579724 loops=1)
Index Cond: -
Worker 3: actual time=3094.585..3094.586
rows=8579724 loops=1
Planning Time: 0.146 ms
Execution Time: 13277.315 ms

-- 8 workers

Finalize Aggregate (actual time=14533.688..14627.962 rows=1 loops=1)
Output: count(*)
-> Gather (actual time=14492.463..14627.950 rows=8 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 8
Workers Launched: 8
-> Partial Aggregate (actual time=14483.059..14483.061 rows=1
loops=8)
Output: PARTIAL count(*)
Worker 0: actual time=14480.058..14480.061 rows=1 loops=1
Worker 1: actual time=14480.948..14480.950 rows=1 loops=1
Worker 2: actual time=14481.668..14481.670 rows=1 loops=1
Worker 3: actual time=14482.829..14482.832 rows=1 loops=1
Worker 4: actual time=14483.695..14483.697 rows=1 loops=1
Worker 5: actual time=14484.290..14484.293 rows=1 loops=1
Worker 6: actual time=14485.166..14485.168 rows=1 loops=1
Worker 7: actual time=14485.819..14485.821 rows=1 loops=1
-> Parallel Bitmap Heap Scan on foo (actual
time=5886.191..14477.239 rows=67428 loops=8)
Filter: ...
Rows Removed by Filter: 1005037
Worker 0: actual time=5882.909..14474.627 rows=60325
loops=1
Worker 1: actual time=5883.788..14474.945 rows=69459
loops=1
Worker 2: actual time=5884.475..14475.735 rows=69686
loops=1
Worker 3: actual time=5886.149..14477.162 rows=64680
loops=1
Worker 4: actual time=5886.987..14477.653 rows=71034
loops=1
Worker 5: actual time=5887.347..14478.667 rows=65836
loops=1
Worker 6: actual time=5888.978..14479.239 rows=67755
loops=1
Worker 7: actual time=5888.896..14479.886 rows=70651
loops=1
-> Bitmap Index Scan on foo_idx (actual
time=3106.840..3106.840 rows=8579724 loops=1)
Index Cond: -
Worker 7: actual time=3106.840..3106.840
rows=8579724 loops=1
Planning Time: 0.150 ms
Execution Time: 14628.648 ms

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gareth Palmer 2022-07-14 10:45:33 Re: [PATCH] Implement INSERT SET syntax
Previous Message Richard Guo 2022-07-14 09:57:15 Re: remove_useless_groupby_columns is too enthusiastic