Re: Default setting for enable_hashagg_disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-07-24 08:40:47
Message-ID: 20200724084047.hnhigkgagzifuitd@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:
>On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> So let me share some fresh I/O statistics collected on the current code
>> using iosnoop. I've done the tests on two different machines using the
>> "aggregate part" of TPC-H Q17, i.e. essentially this:
>>
>> SELECT * FROM (
>> SELECT
>> l_partkey AS agg_partkey,
>> 0.2 * avg(l_quantity) AS avg_quantity
>> FROM lineitem GROUP BY l_partkey OFFSET 1000000000
>> ) part_agg;
>>
>> The OFFSET is there just to ensure we don't need to send anything to
>> the client, etc.
>
>Thanks for testing this.
>
>> So sort writes ~3.4GB of data, give or take. But hashagg/master writes
>> almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
>> original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
>> much more than the 3.4GB of data written by sort (which has to spill
>> everything, while hashagg only spills rows not covered by the groups
>> that fit into work_mem).
>
>What I find when I run your query (with my own TPC-H DB that is
>smaller than what you used here -- 59,986,052 lineitem tuples) is that
>the sort required about 7x more memory than the hash agg to do
>everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
>hash agg memory usage. I'd be surprised if the ratio was very
>different for you -- but can you check?
>

I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?

>I think that there is something pathological about this spill
>behavior, because it sounds like the precise opposite of what you
>might expect when you make a rough extrapolation of what disk I/O will
>be based on the memory used in no-spill cases (as reported by EXPLAIN
>ANALYZE).
>

Maybe, not sure what exactly you think is pathological? The trouble is
hashagg has to spill input tuples but the memory used in no-spill case
represents aggregated groups, so I'm not sure how you could extrapolate
from that ...

FWIW one more suspicious thing that I forgot to mention is the behavior
of the "planned partitions" depending on work_mem, which looks like
this:

2MB Planned Partitions: 64 HashAgg Batches: 4160
4MB Planned Partitions: 128 HashAgg Batches: 16512
8MB Planned Partitions: 256 HashAgg Batches: 21488
64MB Planned Partitions: 32 HashAgg Batches: 2720
256MB Planned Partitions: 8 HashAgg Batches: 8

I'd expect the number of planned partitions to decrease (slowly) as
work_mem increases, but it seems to increase initially. Seems a bit
strange, but maybe it's expected.

>> What I find really surprising is the costing - despite writing about
>> twice as much data, the hashagg cost is estimated to be much lower than
>> the sort. For example on the i5 machine, the hashagg cost is ~10M, while
>> sort cost is almost 42M. Despite using almost twice as much disk. And
>> the costing is exactly the same for master and the CP_SMALL_TLIST.
>
>That does make it sound like the costs of the hash agg aren't being
>represented. I suppose it isn't clear if this is a costing issue
>because it isn't clear if the execution time performance itself is
>pathological or is instead something that must be accepted as the cost
>of spilling the hash agg in a general kind of way.
>

Not sure, but I think we need to spill roughly as much as sort, so it
seems a bit strange that (a) we're spilling 2x as much data and yet the
cost is so much lower.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Tomas Vondra 2020-07-24 13:18:52 Re: Default setting for enable_hashagg_disk
Previous Message PG Doc comments form 2020-07-24 08:07:32 typo in sentence

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2020-07-24 09:05:08 Re: recovering from "found xmin ... from before relfrozenxid ..."
Previous Message Ajin Cherian 2020-07-24 08:15:58 Re: deferred primary key and logical replication