Re: Default setting for enable_hashagg_disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-07-24 13:18:52
Message-ID: 20200724131852.nul7wihayw4uadp7@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Fri, Jul 24, 2020 at 10:40:47AM +0200, Tomas Vondra wrote:
>On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:
>>On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
>><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>So let me share some fresh I/O statistics collected on the current code
>>>using iosnoop. I've done the tests on two different machines using the
>>>"aggregate part" of TPC-H Q17, i.e. essentially this:
>>>
>>> SELECT * FROM (
>>> SELECT
>>> l_partkey AS agg_partkey,
>>> 0.2 * avg(l_quantity) AS avg_quantity
>>> FROM lineitem GROUP BY l_partkey OFFSET 1000000000
>>> ) part_agg;
>>>
>>>The OFFSET is there just to ensure we don't need to send anything to
>>>the client, etc.
>>
>>Thanks for testing this.
>>
>>>So sort writes ~3.4GB of data, give or take. But hashagg/master writes
>>>almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
>>>original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
>>>much more than the 3.4GB of data written by sort (which has to spill
>>>everything, while hashagg only spills rows not covered by the groups
>>>that fit into work_mem).
>>
>>What I find when I run your query (with my own TPC-H DB that is
>>smaller than what you used here -- 59,986,052 lineitem tuples) is that
>>the sort required about 7x more memory than the hash agg to do
>>everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
>>hash agg memory usage. I'd be surprised if the ratio was very
>>different for you -- but can you check?
>>
>
>I can check, but it's not quite clear to me what are we looking for?
>Increase work_mem until there's no need to spill in either case?
>

FWIW the hashagg needs about 4775953kB and the sort 33677586kB. So yeah,
that's about 7x more. I think that's probably built into the TPC-H data
set. It'd be easy to construct cases with much higher/lower factors.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
plans.txt text/plain 1.8 KB

In response to

Browse pgsql-docs by date

  From Date Subject
Next Message Robert Haas 2020-07-24 15:18:48 Re: Default setting for enable_hashagg_disk
Previous Message Tomas Vondra 2020-07-24 08:40:47 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-07-24 13:23:15 Re: renaming configure.in to configure.ac
Previous Message Amul Sul 2020-07-24 13:09:14 Re: [Patch] ALTER SYSTEM READ ONLY