Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, Greg Stark <gsstark(at)mit(dot)edu>,Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>,pgsql-perform <pgsql-performance(at)postgresql(dot)org>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-24 17:58:10
Message-ID: 426BDE32.4070004@dunslane.net (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance

Tom Lane wrote:

>Josh Berkus <josh(at)agliodbs(dot)com> writes:
>  
>
>>Overall, our formula is inherently conservative of n_distinct.   That is, I 
>>believe that it is actually computing the *smallest* number of distinct 
>>values which would reasonably produce the given sample, rather than the 
>>*median* one.  This is contrary to the notes in analyze.c, which seem to 
>>think that we're *overestimating* n_distinct.  
>>    
>>
>
>Well, the notes are there because the early tests I ran on that formula
>did show it overestimating n_distinct more often than not.  Greg is
>correct that this is inherently a hard problem :-(
>
>I have nothing against adopting a different formula, if you can find
>something with a comparable amount of math behind it ... but I fear
>it'd only shift the failure cases around.
>
>
>  
>

The math in the paper does not seem to look at very low levels of q (= 
sample to pop ratio).

The formula has a range of [d,N]. It appears intuitively (i.e. I have 
not done any analysis) that at very low levels of q, as f1 moves down 
from n, the formula moves down from N towards d very rapidly. I did a 
test based on the l_comments field in a TPC lineitems table. The test 
set has N = 6001215, D =  2921877. In my random sample of 1000 I got d = 
976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2 
orders of magnitude.

I wonder if this paper has anything that might help: 
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I 
were more of a statistician I might be able to answer :-)

cheers

andrew



In response to

Responses

pgsql-performance by date

Next:From: Josh BerkusDate: 2005-04-24 18:30:50
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous:From: Marko RistolaDate: 2005-04-24 17:09:15
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

pgsql-hackers by date

Next:From: Joshua D. DrakeDate: 2005-04-24 18:01:28
Subject: Re: Constant WAL replay
Previous:From: Alvaro HerreraDate: 2005-04-24 17:20:39
Subject: Re: Constant WAL replay

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group