Re: Parallel bitmap index scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap index scan
Date: 2020-09-21 06:06:18
Message-ID: CAFiTN-t_23FjA9Oc8AjMXFoQTNWSopmxMYisWAgo=o1r-ZHiWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path. Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader. The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap. So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path. So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap. And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap. The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory. And, all the subsequent workers will merge their TBM
> with the shared TBM. Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue. The
> remaining part is the same, basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>
> BitmapHeapNext
> {
> ...
> BarrierAttach();
> tbm = MultiExecProcNode();
> tbm_merge(tbm); --Merge with common tbm using tbm_union
> BarrierArriveAndWait();
>
> if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
> come out of this
> {
> tbm_prepare_shared_iterate();
> BitmapDoneInitializingSharedState(). -->wakeup others
> }
> tbm_attach_shared_iterate(). --> all worker attach to shared iterator
> ...
> }
>
> Performance: With scale factor 10, I could see that Q6 is spending
> significant time in a bitmap index scan so I have taken the
> performance with that query and I can see that the bitmap index scan
> node is 3x faster by using 3 workers whereas overall plan got ~40%
> faster.
>
> TPCH: S.F. 10, work_mem=512MB shared_buffers: 1GB
>
> HEAD:
> Limit (cost=1559777.02..1559777.03 rows=1 width=32) (actual
> time=5260.121..5260.122 rows=1 loops=1)
> -> Finalize Aggregate (cost=1559777.02..1559777.03 rows=1
> width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
> -> Gather (cost=1559776.69..1559777.00 rows=3 width=32)
> (actual time=5257.251..5289.595 rows=4 loops=1)
> Workers Planned: 3
> Workers Launched: 3
> -> Partial Aggregate (cost=1558776.69..1558776.70
> rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
> -> Parallel Bitmap Heap Scan on lineitem
> (cost=300603.01..1556898.89 rows=375560 width=12) (actual
> time=3475.944..50
> 37.484 rows=285808 loops=4)
> Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
> Heap Blocks: exact=205250
> -> Bitmap Index Scan on
> idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0)
> (actual time=3169.85
> 5..3169.855 rows=1143234 loops=1)
> Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
> time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
> Planning Time: 0.659 ms
> Execution Time: 5289.787 ms
> (13 rows)
>
>
> PATCH:
>
> Limit (cost=1559579.85..1559579.86 rows=1 width=32) (actual
> time=3333.572..3333.572 rows=1 loops=1)
> -> Finalize Aggregate (cost=1559579.85..1559579.86 rows=1
> width=32) (actual time=3333.569..3333.569 rows=1 loops=1)
> -> Gather (cost=1559579.52..1559579.83 rows=3 width=32)
> (actual time=3328.619..3365.227 rows=4 loops=1)
> Workers Planned: 3
> Workers Launched: 3
> -> Partial Aggregate (cost=1558579.52..1558579.53
> rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
> -> Parallel Bitmap Heap Scan on lineitem
> (cost=300405.84..1556701.72 rows=375560 width=12) (actual
> time=1585.726..30
> 97.628 rows=285808 loops=4)
> Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
> Heap Blocks: exact=184293
> -> Parallel Bitmap Index Scan on
> idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0)
> (actual tim
> e=1008.361..1008.361 rows=285808 loops=4)
> Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
> time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
> Planning Time: 0.690 ms
> Execution Time: 3365.420 ms
>
> Note:
> - Currently, I have only parallelized then bitmap index path when we
> have a bitmap index scan directly under bitmap heap. But, if we have
> BitmapAnd or BitmapOr path then I did not parallelize the underlying
> bitmap index scan. I think for BitmapAnd and BitmapOr we should use a
> completely different design, something similar to what we are doing in
> parallel append so I don't think BitmapAnd and BitmapOr we need to
> cover under this patch.
>
> - POC patch is attached to discuss the idea. The patch still needs
> cleanup and testing.

I have rebased this patch on the current head. Apart from this, I
have also measure performance with the higher scalare factor this
time. At a higher scale factor I can see the performance with the
patch is dropping. Basically, the underlying bitmap index scan node
is getting faster with parallelism but the overall performance is
going down due to the TBM merging in the parallel bitmap heap node.
Currently, there is a lot of scope for improving tbm_merge.
- Currently, whichever worker produces the TBM first becomes the host
TBM and all the other workers merge their TBM to that. Ideally, the
largest TBM should become the host TBM.
- While merging we are directly using tbm_union and that need to
reinsert the complete entry in the host TBM's hashtable, I think
instead of merging like this we can create just a shared iterator (and
somehow remove the duplicates) but don't really need to merge the
hashtable. I haven't thought about this design completely but seems
doable, basically by doing this the TBM iterator array will keep the
items from multiple tbm_hashtables.

max_parallel_workers_per_gather=4
work_mem=20GB
shared_buffes=20GB

HEAD
TPCH QUERY (Parallel BitmapHeap+ BitmapIndex)
BitmapIndex
4 19764
535
5 12035
1545
6 119815
7943
14 44154
1007

PATCH
TPCH QUERY (Parallel BitmapHeap+Parallel BitmapIndex).
Parallel BitmapIndex
4 19765
287
5 13799
848
6 116204
3255
14 44078
416

So if we see the performance results, in most of the queries the time
spent in the bitmap index is reduced by half or more but still, the
total time spent in the bitmap heap scan is either not reduced
significantly or it is increased.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v4-0001-POC-Parallel-Bitmap-Index-Scan.patch application/octet-stream 39.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Shaplov 2020-09-21 06:21:23 Is deduplicate_items ever used in nbtree?
Previous Message Namrata Bhave 2020-09-21 05:45:36 Binaries on s390x arch