Re: Default setting for enable_hashagg_disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Rowley <dgrowleyml(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-06-29 21:33:19
Message-ID: 20200629213319.ldfbi6wiyh6idf6e@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Mon, Jun 29, 2020 at 01:31:40PM -0400, Bruce Momjian wrote:
>On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:
>> I have no reason to believe that the planner is any more or any less
>> likely to conclude that the hash table will fit in memory in v13 as
>> things stand (I don't know if the BufFile issue matters).
>>
>> In general, grouping estimates probably aren't very good compared to
>> join estimates. I imagine that in either v12 or v13 the planner is
>> likely to incorrectly believe that it'll all fit in memory fairly
>> often. v12 was much too permissive about what could happen. But v13 is
>> too conservative.
>
>FYI, we have improved planner statistics estimates for years, which must
>have affected node spill behavior on many node types (except hash_agg),
>and don't remember any complaints about it.
>

I think misestimates for GROUP BY are quite common and very hard to fix.
Firstly, our ndistinct estimator may give pretty bad results depending
e.g. on how is the table correlated.

I've been running some TPC-H benchmarks, and for partsupp.ps_partkey our
estimate was 4338776, when the actual value is 15000000, i.e. ~3.5x
higher. This was with statistics target increased to 1000. I can easily
imagine even worse estimates with lower values.

This ndistinct estimator is used even for extended statistics, so that
can't quite save us. Moreover, the grouping may be on top of a join, in
which case using ndistinct coefficients may not be possible :-(

So I think this is a quite real problem ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-docs by date

  From Date Subject
Next Message Peter Geoghegan 2020-06-29 21:46:38 Re: Default setting for enable_hashagg_disk
Previous Message Tomas Vondra 2020-06-29 21:22:29 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-06-29 21:46:38 Re: Default setting for enable_hashagg_disk
Previous Message Tomas Vondra 2020-06-29 21:22:29 Re: Default setting for enable_hashagg_disk