Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: "Dave Held" <dave(dot)held(at)arraysg(dot)com>
To: "pgsql-perform" <pgsql-performance(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-27 15:47:36
Message-ID: 49E94D0CFCD4DB43AFBA928DDD20C8F9026184E1@asg002.asg.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

> -----Original Message-----
> From: Josh Berkus [mailto:josh(at)agliodbs(dot)com]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> Actually, it's more to characterize how large of a sample
> we need. For example, if we sample 0.005 of disk pages, and
> get an estimate, and then sample another 0.005 of disk pages
> and get an estimate which is not even close to the first
> estimate, then we have an idea that this is a table which
> defies analysis based on small samples.

I buy that.

> Wheras if the two estimates are < 1.0 stdev apart, we can
> have good confidence that the table is easily estimated.

I don't buy that. A negative indication is nothing more than
proof by contradiction. A positive indication is mathematical
induction over the set, which in this type of context is
logically unsound. There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.

> [...]
> Yes, actually. We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages
> (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are
> "easily estimated"
> 1 for tables which are not easily estimated by we can't
> afford to sample a large % of pages.

I don't buy that the first and second need to be different
estimation methods. I think you can use the same block
sample estimator for both, and simply stop sampling at
different points. If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases. However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.

Of course, I would recommend the distinct sample technique
for the third case.

> If we're doing sampling-based estimation, I really don't
> want people to lose sight of the fact that page-based random
> sampling is much less expensive than row-based random
> sampling. We should really be focusing on methods which
> are page-based.

Of course, that savings comes at the expense of having to
account for factors like clustering within blocks. So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-04-27 16:39:44 Re: [HACKERS] Continue transactions after errors in psql
Previous Message Josh Berkus 2005-04-27 15:25:16 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

Browse pgsql-performance by date

  From Date Subject
Next Message Richard Rowell 2005-04-27 16:07:14 Suggestions for a data-warehouse migration routine
Previous Message Josh Berkus 2005-04-27 15:25:16 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?