Improving the accuracy of estimate_num_groups()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Improving the accuracy of estimate_num_groups()
Date: 2010-01-23 04:19:39
Message-ID: 28948.1264220379@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was just poking at the test case provided by Allen Johnson in bug
#5294. The essence of the complaint is that the planner is choosing
a sort-and-GroupAggregate plan when a HashAggregate-and-then-sort
plan would be faster, because the aggregation steps are roughly the
same speed either way while post-aggregation sorting is a lot faster
because it has fewer rows to process. The reason the planner makes
the wrong choice is that it doesn't think the aggregation will reduce
the number of rows. And that can be blamed on estimate_num_groups(),
which is a largely heuristic affair anyway but seems to fall down
particularly badly on the number of groups in a grouped join query.

In a join situation, what estimate_num_groups() does is mostly to
compute the product of its estimates of the number of groups in each
input relation. Of course that's often a huge overestimate. This is
masked to some extent by clamping the result to be at most the estimate
of the unaggregated join size, which is why we get exactly the same
pre-aggregation and post-aggregation rowcount estimates in Allen's
example. But we need to do better if we're to have any hope of making
intelligent choices about this.

The only bit of intelligence estimate_num_groups() adds for join cases
is that it throws away any grouping variables that have been found to be
equal to other grouping variables; that is, given
select ... from ... where a.x = b.y group by a.x, b.y
the estimate will be the smaller of the number of x or y values
rather than their product. However, that doesn't help in the least
for Allen's example, because only one of each pair of join keys
appears among the grouping columns.

For cases involving equated grouping columns in the same relation,
we don't use that heuristic anyway; what we do is compute the product of
the number of values and then reduce that by the estimated selectivity
of the available restriction clauses. That seems to work reasonably
well, or at least better than what is happening at the join level.
So it strikes me that maybe we should delete the drop-equal-variables
heuristic altogether (basically, reduce add_unique_group_var() to just
lappend) and then multiply the ending number-of-groups estimate by the
selectivity of the join clauses. In this way we take some account of
join clauses that aren't equating one grouping column to another,
whereas right now they're completely ignored.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-01-23 04:44:11 Re: 8.5 vs. 9.0, Postgres vs. PostgreSQL
Previous Message Alex Hunsaker 2010-01-23 03:59:10 Re: Miscellaneous changes to plperl [PATCH]