Re: Disk-based hash aggregate's cost model

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Disk-based hash aggregate's cost model
Date: 2020-08-30 15:03:15
Message-ID: 20200830150315.taldaoobpaiv3ot5@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote:
>On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:
>>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
>>>We have a Postgres 13 open item for Disk-based hash aggregate, which
>>>is the only non-trivial open item. There is a general concern about
>>>how often we get disk-based hash aggregation when work_mem is
>>>particularly low, and recursion seems unavoidable. This is generally
>>>thought to be a problem in the costing.
>>
>>We discussed two approaches to tweaking the cost model:
>>
>>1. Penalize HashAgg disk costs by a constant amount. It seems to be
>>chosen a little too often, and we can reduce the number of plan
>>changes.
>>
>>2. Treat recursive HashAgg spilling skeptically, and heavily penalize
>>recursive spilling.
>>
>>The problem with approach #2 is that we have a default hash mem of 4MB,
>>and real systems have a lot more than that. In this scenario, recursive
>>spilling can beat Sort by a lot.
>>
>
>I think the main issue is that we're mostly speculating what's wrong.
>I've shared some measurements and symptoms, and we've discussed what
>might be causing it, but I'm not really sure we know for sure.
>
>I really dislike (1) because it seems more like "We don't know what's
>wrong so we'll penalize hashagg," kind of solution. A much more
>principled solution would be to tweak the costing accordingly, not just
>by adding some constant. For (2) it really depends if recursive spilling
>is really the problem here. In the examples I shared, the number of
>partitions/batches was very different, but the query duration was
>mostly independent (almost constant).
>

I've decided to look at the costing a bit more closely today, and see
why the costing is so much lower compared to sort/groupagg. I've used
the same 32GB dataset and query as in [1].

I've repeated tests for all the work_mem values, and I see the number of
batches are much lower, probably thanks to the HLL improvement:

2MB Planned: 64 Batches (old): 4160 Batches: 2977
4MB Planned: 128 Batches (old): 16512 Batches: 1569
8MB Planned: 256 Batches (old): 21488 Batches: 1289
64MB Planned: 32 Batches (old): 2720 Batches: 165
256MB Planned: 8 Batches (old): 8 Batches: 41

The impact on duration of the queries seems pretty negligible, though.

The plans with work_mem=64MB look like this:

1) hashagg

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1)
-> HashAggregate (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642 rows=6400000 loops=1)
Group Key: lineitem.l_partkey
Planned Partitions: 32 Batches: 165 Memory Usage: 67857kB Disk Usage: 6802432kB
-> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631 rows=192000551 loops=1)
Planning Time: 0.064 ms
Execution Time: 213441.986 ms
(7 rows)

2) groupagg

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1)
-> GroupAggregate (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683 rows=6400000 loops=1)
Group Key: lineitem.l_partkey
-> Sort (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727 rows=192000551 loops=1)
Sort Key: lineitem.l_partkey
Sort Method: external merge Disk: 3742208kB
-> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264 rows=192000551 loops=1)
Planning Time: 0.063 ms
Execution Time: 180209.580 ms
(9 rows)

I still don't understand why the hashagg is costed so low compared to
the sort (only about 1/3 of it), because both cases use exactly the same
estimates internally. cost_tuplesort ends up with

npages = 937455
nruns = 114.435396
input_bytes = 7679629440
log_runs = 1.0

while cost_agg uses

pages_read = 937455
pages_written = 937455
relation_size = 7679629440;

yet we end up with much lower estimate for hashagg. It however does seem
to me this is mostly due to non-I/O costs, considered by cost_tuplesort
and perhaps ignored by cost_agg? In particular, most of the sort cost
comes from this

*startup_cost = comparison_cost * tuples * LOG2(tuples);

So I'm wondering if the hashagg is not ignoring similar non-I/O costs
for the spilling case. In particular, the initial section computing
startup_cost seems to ignore that we may need to so some of the stuff
repeatedly - for example we'll repeat hash lookups for spilled tuples,
and so on.

The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same. For hashagg
the iosnoop shows 5921MB reads and 7185MB writes, while sort only does
2895MB reads and 3655MB writes. Which kinda matches the observed sizes
of temp files in the two cases, so the input_bytes for sort seems to be
a bit overestimated.

regards

[1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-08-30 16:23:12 Re: Row estimates for empty tables
Previous Message Tom Lane 2020-08-30 14:48:42 Re: Rare link canary failure in dblink test