Skip site navigation (1)
Skip section navigation (2)
## Group-count estimation statistics

### Responses

### pgsql-hackers by date

I got a complaint from a fellow Red Hatter that PG 8.0 is way slower than 7.4 on some statistical analysis tasks he was doing. Investigation showed that the primary problem was selection of Sort/GroupAgg in place of HashAgg to compute some grouped aggregates. Essentially he was doing select sum(), avg() ... from temp_table group by customer_id, item_id, month, year; where the temp table had just been constructed and hadn't been analyzed in any way. (Of course I told him "so analyze" ... but it didn't help.) In 7.4, the utter lack of any information about the source table caused the planner to assume only 1000 rows, so of course the estimated hash table size was plenty small enough to fit in sort_mem. In 8.0, the planner looks at the true physical size of the table, and even without any stats comes out with a rowcount estimate tolerably close to the actual value (about 10M rows). With or without stats it then estimates that there will also be about 10M groups, under which conditions even raising sort_mem to the max won't persuade it to hash-aggregate. (However the actual number of groups is about 0.2M, and hash aggregation is really a good bit faster than sorting.) The reason this happens even with stats is that the algorithm for estimating the number of groups in a multi-group-column situation is basically "take the product of the number of distinct values of each grouping column, but clamp to the number of rows estimated for the whole table". It turns out that customer_id and item_id are pretty highly correlated, enough that the actual number of distinct groups is barely a hundredth of what you get from the product rule. The only real solution, of course, is to acquire cross-column statistics, but I don't see that happening in the near future. As a short-term hack, I am thinking that the "clamp to size of table" part of the rule is overly pessimistic, and that we should consider something like "clamp to size of table / 10" instead. The justification for this is the thought that you aren't going to bother grouping unless it actually reduces the data volume. We have used similar rules in the past --- for example, before the logic for trying to estimate actual group counts was put in, the estimate for the number of output rows from an Agg or Group node was just the number of input rows over 10. Essentially I think that that rule wasn't too bad, and that we should allow the statistics-based estimation code to reduce that estimate but not increase it --- at least not when there's more than one variable involved. I have some faith in the stats-based approach for a single column but hardly any for multi columns, because of the correlation issue. Comments? regards, tom lane

- Re: Group-count estimation statistics at 2005-01-28 18:25:40 from Stephen Frost
- Re: Group-count estimation statistics at 2005-01-28 18:29:10 from Greg Stark
- Re: Group-count estimation statistics at 2005-01-28 23:42:09 from Sailesh Krishnamurthy
- Re: Group-count estimation statistics at 2005-01-31 11:08:41 from Manfred Koizar

Next: From:Josh BerkusDate:2005-01-28 17:36:19Subject: Re: [pgsql-hackers] Patent issues and 8.1Previous: From: Oleg BartunovDate: 2005-01-28 15:52:54Subject: Re: strange 'vacuum verbose analyze' behaviour