Skip site navigation (1)
Skip section navigation (2)
## Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

### In response to

### Responses

### pgsql-performance by date

### pgsql-hackers by date

Tom Lane wrote: >Josh Berkus <josh(at)agliodbs(dot)com> writes: > > >>Overall, our formula is inherently conservative of n_distinct. That is, I >>believe that it is actually computing the *smallest* number of distinct >>values which would reasonably produce the given sample, rather than the >>*median* one. This is contrary to the notes in analyze.c, which seem to >>think that we're *overestimating* n_distinct. >> >> > >Well, the notes are there because the early tests I ran on that formula >did show it overestimating n_distinct more often than not. Greg is >correct that this is inherently a hard problem :-( > >I have nothing against adopting a different formula, if you can find >something with a comparable amount of math behind it ... but I fear >it'd only shift the failure cases around. > > > > The math in the paper does not seem to look at very low levels of q (= sample to pop ratio). The formula has a range of [d,N]. It appears intuitively (i.e. I have not done any analysis) that at very low levels of q, as f1 moves down from n, the formula moves down from N towards d very rapidly. I did a test based on the l_comments field in a TPC lineitems table. The test set has N = 6001215, D = 2921877. In my random sample of 1000 I got d = 976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2 orders of magnitude. I wonder if this paper has anything that might help: http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I were more of a statistician I might be able to answer :-) cheers andrew

- Re: [HACKERS] Bad n_distinct estimation; hacks suggested? at 2005-04-24 04:48:59 from Tom Lane

- Re: [HACKERS] Bad n_distinct estimation; hacks suggested? at 2005-04-24 18:30:50 from Josh Berkus
- Re: [HACKERS] Bad n_distinct estimation; hacks suggested? at 2005-04-24 19:08:15 from Josh Berkus

Next: From:Josh BerkusDate:2005-04-24 18:30:50Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?Previous: From: Marko RistolaDate: 2005-04-24 17:09:15Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

Next: From:Joshua D. DrakeDate:2005-04-24 18:01:28Subject: Re: Constant WAL replayPrevious: From: Alvaro HerreraDate: 2005-04-24 17:20:39Subject: Re: Constant WAL replay