Re: BUG #4264: Optimizer fails to use hash_aggregate when appropriate.

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #4264: Optimizer fails to use hash_aggregate when appropriate.
Date: 2008-06-26 02:40:23
Message-ID: a1ec7d000806251940h3eedc28en92f7d3118c6163d2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom,
Thank you for the quick reply. This information was very useful.

I apologize as I was not aware of the internal limitations of the hash
aggregate implementation and the DISTINCT keyword. I also seem to have
mischaracterized this issue in an attempt to reduce the problem to one less
specific than my original encounter. One half of this may yet be a bug (or
enhancement request).

First, the original issue that got me on this track. I'll keep it to the
point since it is likely already known (though I could not find it by
searching archives).
2. The query planner does not estimate the number of returned rows
appropriately when table inheritance / partitioning is involved, leading to
poor choices for aggregation. Here are two explains that demonstrate this.
The parent table has 0 rows, estimated as 1 row. Ideally it should not
affect the query plan.

rr=> explain SELECT g_id, p_type, strat from log.creative_display_logs
WHERE date='2008-06-15' and s_id=12 GROUP BY g_id, p_type, strat;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Group (cost=6488615.49..6747134.82 rows=2585194 width=130)
-> Sort (cost=6488615.49..6553245.32 rows=25851933 width=130)
Sort Key: log.creative_display_logs.g_id,
log.creative_display_logs.p_type, log.creative_display_logs.strat
-> Append (cost=0.00..1450170.88 rows=25851933 width=130)
-> Seq Scan on creative_display_logs (cost=0.00..10.90
rows=1 width=130)
Filter: ((date = '2008-06-15'::date) AND (s_id = 12))
-> Seq Scan on creative_display_logs_12_2008_6_15
creative_display_logs (cost=0.00..1450159.98 rows=25851932 width=28)
Filter: ((date = '2008-06-15'::date) AND (s_id = 12))
(8 rows)

rr=> explain SELECT g_id, p_type, strat from
p_log.creative_display_logs_12_2008_6_15 GROUP BY g_id, p_type, strat;
QUERY
PLAN
------------------------------------------------------------------------------------------------------
HashAggregate (cost=1514789.81..1514801.57 rows=1176 width=28)
-> Seq Scan on creative_display_logs_12_2008_6_15
(cost=0.00..1320900.32 rows=25851932 width=28)
(2 rows)

As you can see, something about the inheritance changes the expected # of
rows dramatically and affects the plan inappropriately. The addition of one
scan that returns 0 or 1 row cannot increase the actual output by a couple
million.

Next:

Thanks to your insight, ran some tests and demonstrated the vast difference
in query plans resulting from the two queries (on a large table , on a
column with few values) of the form:
SELECT column from TABLE GROUP BY column; (uses hash_aggregate)
SELECT DISTINCT column from TABLE; (full sort, then aggregate)

Thus, a query of the form:
SELECT count(distinct v_guid) as v_count, p_type FROM table GROUP BY
p_type;

can be rewritten as the below to work around the query planner limitation
and persuade a more efficient query. (Thanks again for highlighting exactly
where this limitation is).

explain SELECT count(temp.v_guid) as v_count, temp.p_type from (SELECT
v_guid, p_type FROM table GROUP BY v_guid, p_type) as temp group by p_type;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1451905.60..1451908.10 rows=200 width=520)
-> HashAggregate (cost=1450159.98..1450858.23 rows=69825 width=50)
-> Seq Scan on creative_display_logs_12_2008_6_15
(cost=0.00..1320900.32 rows=25851932 width=50)

This obfuscates the original intention of the query and is not an ideal
solution, but given the limitations of the planner, and 20% the query
runtime, it is a pragmatic necessity.
If, the sort implied in the distinct was required, it could be applied
separately and also significantly improve the query performance.

>DISTINCT forces a sort, so there wouldn't be any advantage.

I contend that this is false. Example: Hash_aggregate from 25 million rows
to 10000 unique rows, then sort the result. This is faster than sorting 25
million records then group aggregating by about a factor of 5 on the
hardware I am using. In fact, it is faster even if the aggregate only
reduces the intermediate results by about a factor of 10 (for 25 million
rows of the size I'm using -- more rows makes it even more favorable).

I contend that
SELECT DISTINCT column FROM table;
and
SELECT column FROM table GROUP BY column ORDER BY column;

Should consider such plans. Currently, these both fail to produce optimal
queries on large datasets when the aggregation reduces the result size
enough.
Additionally, the implied sort can be ignored if the columns in a distinct
are aggregated, as in the first example, since aggregates are input-order
insensitive.

Lastly, a more sophisticated hash-aggregate could be multi-tiered and apply
to my first example, hashing by the the group by clause and then sub-hashing
the column distinct values. No intermediate table would be required in this
case, unlike my query butchering above. This combined with the other query
plan options would allow for clear, natural - looking queries without hints
or hacking up the query into more ugly and less flexible forms as I am
required to above (both are mere work-arounds, but work-arounds are
necessary stopgaps).

Thanks for the clarification on the limitations of the query planner Tom,
and the reminder that DISTINCT forces a sort -- I forgot all about that.

-Scott Carey

On Wed, Jun 25, 2008 at 1:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Scott Carey" <scott(at)richrelevance(dot)com> writes:
> > The query optimizer fails to use a hash aggregate most of the time. This
> is
> > an inconsistent behavior.
>
> Hash aggregation is not used when a DISTINCT aggregate is specified.
> DISTINCT forces a sort, so there wouldn't be any advantage.
>
> > Of course DB hints would solve this.
>
> No, they wouldn't.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Clemens Fischer 2008-06-26 06:42:52 BUG #4265: PostgreSQL fails to compile
Previous Message Bruce Momjian 2008-06-26 01:12:09 Re: [BUGS] BUG #4128: The postmaster.opts.default file is begin ignored