Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
To: <josh(at)agliodbs(dot)com>
Cc: <gsstark(at)mit(dot)edu>, <marko(dot)ristola(at)kolumbus(dot)fi>, <pgsql-performance(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-23 23:44:28
Message-ID: 1963.24.211.165.134.1114299868.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Josh Berkus said:
>
>
> Well, unusual distributions are certainly tough. But I think the
> problem exists even for relatively well-distributed populations.
> Part of it is, I believe, the formula we are using:
>
> n*d / (n - f1 + f1*n/N)
>
[snip]
>
> This is so broken, in fact, that I'm wondering if we've read the paper
> right? I've perused the paper on almaden, and the DUJ1 formula
> appears considerably more complex than the formula we're using.
>
> Can someone whose math is more recent than calculus in 1989 take a look
> at that paper, and look at the formula toward the bottom of page 10,
> and see if we are correctly interpreting it? I'm particularly
> confused as to what "q" and "d-sub-n" represent. Thanks!
>

Math not too recent ...

I quickly read the paper and independently came up with the same formula you
say above we are applying. The formula is on the page that is numbered 6,
although it's the tenth page in the PDF.

q = n/N = ratio of sample size to population size
d_sub_n = d = number of distinct classes in sample

cheers

andrew

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2005-04-23 23:53:16 Re: Bad n_distinct estimation; hacks suggested?
Previous Message Josh Berkus 2005-04-23 23:39:11 Re: Bad n_distinct estimation; hacks suggested?

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2005-04-23 23:53:16 Re: Bad n_distinct estimation; hacks suggested?
Previous Message Josh Berkus 2005-04-23 23:39:11 Re: Bad n_distinct estimation; hacks suggested?