Re: Trouble with hashagg spill I/O pattern and costing

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Trouble with hashagg spill I/O pattern and costing
Date: 2020-05-25 12:17:22
Message-ID: 20200525121722.6wzcelweefyuwg54@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 25, 2020 at 04:10:45AM +0200, Tomas Vondra wrote:
>
> ...
>
>parallel queries
>================
>
>And now the fun begins ...
>
>
>1) small one (SSD, max_parallel_workers_per_gather = 2)
>
> algorithm master tlist prealloc prealloc+tlist
> --------------------------------------------------
> hash 693 390 177 128
> sort 103 99 101 99
>
>This looks pretty nice - the patches have the expected effect, it got
>faster than with just a single CPU etc.
>
>
>2) big one (SATA, max_parallel_workers_per_gather = 16)
>
> algorithm master tlist prealloc prealloc+tlist
> --------------------------------------------------
> hash ? 25000 ? 3132
> sort 248 234 216 200
>
>Well, not that nice :-( The hash queries take so much time that I've
>decided not to wait for them and the two numbers are actually just
>estimates (after processing just a couple of logical tapes).
>
>Plus it actually gets slower than with serial execution, so what's the
>problem here? Especially considering it worked OK on the small machine?
>
>At first I thought it's something about SSD vs. SATA, but it seems to be
>more about how we construct the plans, because the plans between the two
>machines are very different. And it seems to be depend by the number of
>workers per gather - for low number of workers the plan looks like this
>(the plans are attached in plans.txt in case the formatting gets broken
>by your client):
>
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------
> Limit
> -> Aggregate
> -> Hash Join
> Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
> Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
> -> Gather
> Workers Planned: 2
> -> Nested Loop
> -> Parallel Seq Scan on part
> Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
> -> Index Scan using idx_lineitem_part_supp on lineitem
> Index Cond: (l_partkey = part.p_partkey)
> -> Hash
> -> Finalize HashAggregate
> Group Key: lineitem_1.l_partkey
> -> Gather
> Workers Planned: 2
> -> Partial HashAggregate
> Group Key: lineitem_1.l_partkey
> -> Parallel Seq Scan on lineitem lineitem_1
> (20 rows)
>
>but then if I crank the number of workers up, it switches to this:
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
> Limit
> -> Finalize Aggregate
> -> Gather
> Workers Planned: 5
> -> Partial Aggregate
> -> Nested Loop
> Join Filter: (part.p_partkey = lineitem.l_partkey)
> -> Hash Join
> Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
> -> Parallel Seq Scan on part
> Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
> -> Hash
> -> HashAggregate
> Group Key: lineitem_1.l_partkey
> -> Seq Scan on lineitem lineitem_1
> -> Index Scan using idx_lineitem_part_supp on lineitem
> Index Cond: (l_partkey = lineitem_1.l_partkey)
> Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
> (18 rows)
>
>
>Notice that in the first plan, the hashagg is on top of parallel-aware
>path - so each workers builds hashagg only on a subset of data, and also
>spills only a fraction of the input rows (so that all workers combined
>spill rouhly the "whole" table).
>

OK, I've done an experiment and re-ran the test with

max_parallel_workers_per_gather = 5

which is the highest value still giving the "good" plan, and the results
look like this:

master tlist prealloc prealloc+tlist
----------------------------------------------------
hash 10535 1044 1723 407
sort 198 196 192 219

which is obviously *way* better than the numbers with more workers:

> algorithm master tlist prealloc prealloc+tlist
> --------------------------------------------------
> hash ? 25000 ? 3132
> sort 248 234 216 200

It's still ~2x slower than the sort, so presumably we'll need to tweak
the costing somehow. I do belive this is still due to differences in I/O
patterns, with parallel hashagg probably being a bit more random (I'm
deducing that from SSD not being affected by this).

I'd imagine this is because given the same work_mem value, sort tends to
create "sorted chunks" that are then merged into larger runs, making it
more sequential. OTOH hashagg likely makes it more random with smaller
work_mem values - more batches making it more interleaved / random.

This does not explain why we end up with the "bad" plans, though.

Attached are two files showing how the plan changes for different number
of workers per gather, both for groupagg and hashagg. For groupagg the
plan shape does not change at all, for hashagg it starts as "good" and
then between 5 and 6 switches to "bad" one.

There's another interesting thing I just noticed - as we increase the
number of workers, the cost estimate actually starts to grow at some
point:

workers | plan cost
0 | 23594267
1 | 20155545
2 | 19785306
5 | 22718176 <-
6 | 23063639
10 | 22990594
12 | 22972363

AFAIK this happens because we pick the number of workers simply based on
size of the input relation, which ignores the cost due to sending data
from workers to leaders (parallel_tuple_cost). Which in this case is
quite significant, because each worker produces large number of groups.
I don't think this is causing the issue, though, because the sort plans
behave the same way. (I wonder if we could/should consider different
number of workers, somehow.)

We probably can't see these plans on 12 simply because hashagg would
need more memory than work_mem (especially in parallel mode), so we
simply reject them.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
large-plans-sort.txt text/plain 8.3 KB
large-plans-hash.txt text/plain 14.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2020-05-25 13:02:04 Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()
Previous Message Ranier Vilela 2020-05-25 12:16:47 Re: PostgresSQL 13.0 Beta 1 on Phoronix