Gather Merge

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Gather Merge
Date: 2016-10-05 06:05:45
Message-ID: CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Attached is the patch to implement Gather Merge. The Gather Merge node would
assume that the results from each worker are ordered with respect to each
other,
and then do a final merge pass over those. This is so that we get the
top-level
query ordering we want. The final plan for such a query would look something
like this:

Gather Merge
-> Sort
-> Parallel Seq Scan on foo
Filter: something

With this we now have a new parallel node which will always return the
sorted
results, so that any query having the pathkey can build the gather merge
path.
Currently if a query has a pathkey and we want to make it parallel-aware,
the
plan would be something like this:

Sort
-> Gather
-> Parallel Seq Scan on foo
Filter: something

With GatherMerge now it is also possible to have plan like:

Finalize GroupAggregate
-> Gather Merge
-> Partial GroupAggregate
-> Sort

With gather merge, sort can be pushed under the Gather Merge. It's valuable
as it has very good performance benefits. Consider the following test
results:

1) ./db/bin/pgbench postgres -i -F 100 -s 20
2) update pgbench_accounts set filler = 'foo' where aid%10 = 0;
3) vacuum analyze pgbench_accounts;
4) set max_parallel_workers_per_gather = 4;

Without patch:

postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts
where filler like '%foo%' group by aid;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=81696.51..85242.09 rows=202605 width=12) (actual
time=1037.212..1162.086 rows=200000 loops=1)
Group Key: aid
-> Sort (cost=81696.51..82203.02 rows=202605 width=8) (actual
time=1037.203..1072.446 rows=200000 loops=1)
Sort Key: aid
Sort Method: external sort Disk: 3520kB
-> Seq Scan on pgbench_accounts (cost=0.00..61066.59 rows=202605
width=8) (actual time=801.398..868.390 rows=200000 loops=1)
Filter: (filler ~~ '%foo%'::text)
Rows Removed by Filter: 1800000
Planning time: 0.133 ms
Execution time: 1171.656 ms
(10 rows)

With Patch:

postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts
where filler like '%foo%' group by aid;

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=47274.13..56644.58 rows=202605 width=12)
(actual time=315.457..561.825 rows=200000 loops=1)
Group Key: aid
-> Gather Merge (cost=47274.13..54365.27 rows=50651 width=0) (actual
time=315.451..451.886 rows=200000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial GroupAggregate (cost=46274.09..47160.49 rows=50651
width=12) (actual time=306.830..333.908 rows=40000 loops=5)
Group Key: aid
-> Sort (cost=46274.09..46400.72 rows=50651 width=8)
(actual time=306.822..310.800 rows=40000 loops=5)
Sort Key: aid
Sort Method: quicksort Memory: 2543kB
-> Parallel Seq Scan on pgbench_accounts
(cost=0.00..42316.15 rows=50651 width=8) (actual time=237.552..255.968
rows=40000 loops=5)
Filter: (filler ~~ '%foo%'::text)
Rows Removed by Filter: 360000
Planning time: 0.200 ms
Execution time: 572.221 ms
(15 rows)

I ran the TPCH benchmark queries with the patch and found that 7 out of 22
queries end up picking the Gather Merge path.

Below benchmark numbers taken under following configuration:

- Scale factor = 10
- max_worker_processes = DEFAULT (8)
- max_parallel_workers_per_gather = 4
- Cold cache environment is ensured. With every query execution - server is
stopped and also OS caches were dropped.
- The reported values of execution time (in ms) is median of 3 executions.
- power2 machine with 512GB of RAM
- PFA performance machine cpu into (benchmark_machine_info.txt)

Query 4: With GM 7901.480 -> Without GM 9064.776
Query 5: With GM 53452.126 -> Without GM 55059.511
Query 9: With GM 52613.132 -> Without GM 98206.793
Query 15: With GM 68051.058 -> Without GM 68918.378
Query 17: With GM 129236.075 -> Without GM 160451.094
Query 20: With GM 259144.232 -> Without GM 306256.322
Query 21: With GM 153483.497 -> Without GM 168169.916

Here from the results we can see that query 9, 17 and 20 are the one which
show good performance benefit with the Gather Merge.

PFA tpch_result.tar.gz for the explain analysis output for TPCH queries
(with
and without patch)

I ran the TPCH benchmark queries with different number of workers and found
that
Query 18 also started picking Gather merge with worker > 6. PFA attach
TPCH_GatherMerge.pdf for the detail benchmark results.

Implementation details:

New Gather Merge node:

The patch introduces a new node type for Gather Merge. The Gather Merge
implementation is mostly similar to what Gather does. The major difference
is
that the Gather node has two TupleTableSlots; one for leader and one for the
tuple read from the worker, and Gather Merge has a TupleTableSlot per
worker,
plus one for the leader. As for Gather Merge, we need to fill every slot,
then
build a heap of the tuples and return the lowest one.

The patch generates the gather merge path from:

a) create_ordered_paths() for partial_pathlist. If the pathkey contain the
sort_pathkey, then directly create the gather merge. If not then create sort
and then create the gather merge path.

Example:

explain analyze
select * from pgbench_accounts where filler like '%foo%' order by aid;

b) create_distinct_paths(): when sort-based implementations of DISTINCT is
possible.

Example:

explain analyze
select distinct * from pgbench_accounts where filler like '%foo%' order by
aid;

c) create_grouping_paths() : While generating a complete GroupAgg Path, loop
over the partial path list and if partial path contains the group_pathkeys
generate the gather merge path.

Example:

explain analyze
select * from pgbench_accounts where filler like '%foo%' group by aid;

In all the above mentioned cases, with the patches it's giving almost a 2x
performance gain. PFA pgbench_query.out, for the explain analyze output for
the
queries.

Gather Merge reads the tuple from each queue and then picks the lowest one,
so
every time it has to read the tuple from the queue into wait mode. During
testing I found that some of the query spent some time reading tuple
from the queue. So in the patch I introduced the tuple array; once we get
the tuple into wait mode, it tries to read more tuples in nowait mode and
store it into the tuple array. Once we get one tuple through the queue,
there
are chances to have more ready tuple into queue, so just read it and, if
any,
store it to the tuple array. With this I found good performance benefits
with
some of the TPC-H complex queries.

Costing:

GatherMerge merges several pre-sorted input streams using a heap.
Considering
that costing for Gather Merge is the combination of cost_gather +
cost_merge_append.

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge that
is
not possible, as we storing the tuple into Tuple array and we want tuple to
be
free only its get pass through the merge sort algorithm. One idea is, we can
also call gm_readnext_tuple() under per-tuple context (which will fix the
leak
into tqueue.c) and then store the copy of tuple into tuple array.
- Need to see a way to add the simple test as part of regression.

Thanks to my colleague Robert Haas for his help in design and some
preliminary review for the patch.

Please let me know your thought, and thanks for reading.

Regards,
Rushabh Lathia
www.EnterpriseDB.com

Attachment Content-Type Size
benchmark_machine_info.txt text/plain 607 bytes
TPCH_GatherMerge.pdf application/pdf 52.1 KB
tpch_results.tar.gz application/x-gzip 29.7 KB
gather_merge_v1.patch application/x-download 49.0 KB
pgbench_query.out application/octet-stream 10.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-10-05 06:18:16 strange case of kernel performance regression (4.3.x and newer)
Previous Message Magnus Hagander 2016-10-05 06:02:42 Re: Hash tables in dynamic shared memory