Quick Links

Re: Bad n_distinct estimation; hacks suggested?

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	josh(at)agliodbs(dot)com
Cc:	Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Bad n_distinct estimation; hacks suggested?
Date:	2005-04-23 08:02:05
Message-ID:	8764ydapky.fsf@stark.xeocode.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

Josh Berkus <josh(at)agliodbs(dot)com> writes:

> ... I think the problem is in our heuristic sampling code. I'm not the first
> person to have this kind of a problem. Will be following up with tests ...

I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized samples
only work for things like the histograms that can be analyzed through standard
statistics population sampling which depends on the law of large numbers.

n_distinct requires you to estimate how frequently very rare things occur. You
can't apply the law of large numbers because even a single instance of a value
out of a large pool alters the results disproportionately.

To get a valid estimate for n_distinct you would need to sample a fixed
percentage of the table. Not a fixed size sample. That just isn't practical.
Moreover, I think the percentage would have to be quite large. Even if you
sampled half the table I think the confidence interval would be quite wide.

The situation is worsened because it's unclear how to interpolate values for
subsets of the table. If the histogram says you have a million records and
you're adding a clause that has a selectivity of 50% then half a million is a
good guess. But if what you care about is n_distinct and you start with a
million records with 1,000 distinct values and then apply a clause that
filters 50% of them, how do you estimate the resulting n_distinct? 500 is
clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0
to 1,000 and you have no good way to figure out where the truth lies.

So I fear this is fundamentally a hopeless situation. It's going to be
difficult to consistently get good plans for any queries that depend on good
estimates for n_distinct.

--
greg

In response to

Re: Bad n_distinct estimation; hacks suggested? at 2005-04-22 20:36:08 from Josh Berkus

Responses

Re: Bad n_distinct estimation; hacks suggested? at 2005-04-23 23:39:11 from Josh Berkus

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Hannu Krosing	2005-04-23 09:17:26	Re: possible TODO: read-only tables, select from indexes
Previous Message	Kevin Brown	2005-04-23 06:28:33	Re: Postgres: pg_hba.conf, md5, pg_shadow, encrypted passwords

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Joel Fradkin	2005-04-23 14:11:28	Re: Joel's Performance Issues WAS : Opteron vs Xeon
Previous Message	Kevin Brown	2005-04-23 06:30:45	Re: Joel's Performance Issues WAS : Opteron vs Xeon