Re: Improving the accuracy of estimate_num_groups()

From: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving the accuracy of estimate_num_groups()
Date: 2010-01-27 14:32:08
Message-ID: e08cc0401001270632j1b389e67w70a16ec3ae1a4333@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2010/1/23 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> I was just poking at the test case provided by Allen Johnson in bug
> #5294.  The essence of the complaint is that the planner is choosing
> a sort-and-GroupAggregate plan when a HashAggregate-and-then-sort
> plan would be faster, because the aggregation steps are roughly the
> same speed either way while post-aggregation sorting is a lot faster
> because it has fewer rows to process.  The reason the planner makes
> the wrong choice is that it doesn't think the aggregation will reduce
> the number of rows.  And that can be blamed on estimate_num_groups(),
> which is a largely heuristic affair anyway but seems to fall down
> particularly badly on the number of groups in a grouped join query.
>
> In a join situation, what estimate_num_groups() does is mostly to
> compute the product of its estimates of the number of groups in each
> input relation.  Of course that's often a huge overestimate.  This is
> masked to some extent by clamping the result to be at most the estimate
> of the unaggregated join size, which is why we get exactly the same
> pre-aggregation and post-aggregation rowcount estimates in Allen's
> example.  But we need to do better if we're to have any hope of making
> intelligent choices about this.
>
> The only bit of intelligence estimate_num_groups() adds for join cases
> is that it throws away any grouping variables that have been found to be
> equal to other grouping variables; that is, given
>        select ... from ... where a.x = b.y group by a.x, b.y
> the estimate will be the smaller of the number of x or y values
> rather than their product.  However, that doesn't help in the least
> for Allen's example, because only one of each pair of join keys
> appears among the grouping columns.
>
> For cases involving equated grouping columns in the same relation,
> we don't use that heuristic anyway; what we do is compute the product of
> the number of values and then reduce that by the estimated selectivity
> of the available restriction clauses.  That seems to work reasonably
> well, or at least better than what is happening at the join level.
> So it strikes me that maybe we should delete the drop-equal-variables
> heuristic altogether (basically, reduce add_unique_group_var() to just
> lappend) and then multiply the ending number-of-groups estimate by the
> selectivity of the join clauses.  In this way we take some account of
> join clauses that aren't equating one grouping column to another,
> whereas right now they're completely ignored.
>
> Comments?

+1 for deleting heuristic part. In most cases this kind of wrong
estimate occurs novices designed poor schema and usually they cannot
find out the bad point they themselves made then say "hey, PostgreSQL
is slow."

At least it seems to me the possibility that estimated number of rows
at pre-aggregate and post-aggregate can be the same should be removed
even though actually it can *really* produce such rows. Generally
aggregate reduces rows.

However, I don't see why reported queries produce different plans. Is
"NORMAL QUERY" join case? To me, "HACK QUERY" also looks like join
case, not same relation case.

Regards,

--
Hitoshi Harada

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tim Bunce 2010-01-27 14:33:18 Re: Add on_perl_init and proper destruction to plperl [PATCH]
Previous Message Robert Haas 2010-01-27 14:29:06 Re: [BUG?] strange behavior in ALTER TABLE ... RENAME TO on inherited columns