Group-count estimation statistics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Group-count estimation statistics
Date: 2005-01-28 15:53:33
Message-ID: 1172.1106927613@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I got a complaint from a fellow Red Hatter that PG 8.0 is way slower
than 7.4 on some statistical analysis tasks he was doing. Investigation
showed that the primary problem was selection of Sort/GroupAgg in place
of HashAgg to compute some grouped aggregates. Essentially he was doing

select sum(), avg() ... from temp_table
group by customer_id, item_id, month, year;

where the temp table had just been constructed and hadn't been analyzed
in any way. (Of course I told him "so analyze" ... but it didn't help.)

In 7.4, the utter lack of any information about the source table caused
the planner to assume only 1000 rows, so of course the estimated hash
table size was plenty small enough to fit in sort_mem. In 8.0, the
planner looks at the true physical size of the table, and even without
any stats comes out with a rowcount estimate tolerably close to the
actual value (about 10M rows). With or without stats it then estimates
that there will also be about 10M groups, under which conditions even
raising sort_mem to the max won't persuade it to hash-aggregate.
(However the actual number of groups is about 0.2M, and hash aggregation
is really a good bit faster than sorting.)

The reason this happens even with stats is that the algorithm for
estimating the number of groups in a multi-group-column situation
is basically "take the product of the number of distinct values of
each grouping column, but clamp to the number of rows estimated for
the whole table". It turns out that customer_id and item_id are
pretty highly correlated, enough that the actual number of distinct
groups is barely a hundredth of what you get from the product rule.

The only real solution, of course, is to acquire cross-column
statistics, but I don't see that happening in the near future.

As a short-term hack, I am thinking that the "clamp to size of table"
part of the rule is overly pessimistic, and that we should consider
something like "clamp to size of table / 10" instead. The justification
for this is the thought that you aren't going to bother grouping unless
it actually reduces the data volume. We have used similar rules in the
past --- for example, before the logic for trying to estimate actual
group counts was put in, the estimate for the number of output rows
from an Agg or Group node was just the number of input rows over 10.
Essentially I think that that rule wasn't too bad, and that we should
allow the statistics-based estimation code to reduce that estimate but
not increase it --- at least not when there's more than one variable
involved. I have some faith in the stats-based approach for a single
column but hardly any for multi columns, because of the correlation
issue.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2005-01-28 17:36:19 Re: [pgsql-hackers] Patent issues and 8.1
Previous Message Oleg Bartunov 2005-01-28 15:52:54 Re: strange 'vacuum verbose analyze' behaviour