Skip site navigation (1)
Skip section navigation (2)
## Re: Bad n_distinct estimation; hacks suggested?

### In response to

### Responses

### pgsql-performance by date

### pgsql-hackers by date

Josh Berkus <josh(at)agliodbs(dot)com> writes: > ... I think the problem is in our heuristic sampling code. I'm not the first > person to have this kind of a problem. Will be following up with tests ... I looked into this a while back when we were talking about changing the sampling method. The conclusions were discouraging. Fundamentally, using constant sized samples of data for n_distinct is bogus. Constant sized samples only work for things like the histograms that can be analyzed through standard statistics population sampling which depends on the law of large numbers. n_distinct requires you to estimate how frequently very rare things occur. You can't apply the law of large numbers because even a single instance of a value out of a large pool alters the results disproportionately. To get a valid estimate for n_distinct you would need to sample a fixed percentage of the table. Not a fixed size sample. That just isn't practical. Moreover, I think the percentage would have to be quite large. Even if you sampled half the table I think the confidence interval would be quite wide. The situation is worsened because it's unclear how to interpolate values for subsets of the table. If the histogram says you have a million records and you're adding a clause that has a selectivity of 50% then half a million is a good guess. But if what you care about is n_distinct and you start with a million records with 1,000 distinct values and then apply a clause that filters 50% of them, how do you estimate the resulting n_distinct? 500 is clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0 to 1,000 and you have no good way to figure out where the truth lies. So I fear this is fundamentally a hopeless situation. It's going to be difficult to consistently get good plans for any queries that depend on good estimates for n_distinct. -- greg

- Re: Bad n_distinct estimation; hacks suggested? at 2005-04-22 20:36:08 from Josh Berkus

- Re: Bad n_distinct estimation; hacks suggested? at 2005-04-23 23:39:11 from Josh Berkus

Next: From:Joel FradkinDate:2005-04-23 14:11:28Subject: Re: Joel's Performance Issues WAS : Opteron vs XeonPrevious: From: Kevin BrownDate: 2005-04-23 06:30:45Subject: Re: Joel's Performance Issues WAS : Opteron vs Xeon

Next: From:Hannu KrosingDate:2005-04-23 09:17:26Subject: Re: possible TODO: read-only tables, select from indexesPrevious: From: Kevin BrownDate: 2005-04-23 06:28:33Subject: Re: Postgres: pg_hba.conf, md5, pg_shadow, encrypted passwords