Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: "Dave Held" "pgsql-perform" , Re: [HACKERS] Bad n_distinct estimation; hacks suggested? 2005-04-27 15:47:36 49E94D0CFCD4DB43AFBA928DDD20C8F9026184E1@asg002.asg.local (view raw or flat) 2005-04-27 15:47:36 from "Dave Held"  2005-04-27 17:16:48 from Greg Stark   2005-04-28 17:44:37 from Marko Ristola pgsql-hackerspgsql-performance
```> -----Original Message-----
> From: Josh Berkus [mailto:josh(at)agliodbs(dot)com]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> Actually, it's more to characterize how large of a sample
> we need.  For example, if we sample 0.005 of disk pages, and
> get an estimate, and then sample another 0.005 of disk pages
> and get an estimate which is not even close to the first
> estimate, then we have an idea that this is a table which
> defies analysis based on small samples.

> Wheras if the two estimates are < 1.0 stdev apart, we can
> have good confidence that the table is easily estimated.

I don't buy that.  A negative indication is nothing more than
proof by contradiction.  A positive indication is mathematical
induction over the set, which in this type of context is
logically unsound.  There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.

> [...]
> Yes, actually.   We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages
> (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are
> "easily estimated"
> 1 for tables which are not easily estimated by we can't
> afford to sample a large % of pages.

I don't buy that the first and second need to be different
estimation methods.  I think you can use the same block
sample estimator for both, and simply stop sampling at
different points.  If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases.  However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.

Of course, I would recommend the distinct sample technique
for the third case.

> If we're doing sampling-based estimation, I really don't
> want people to lose sight of the fact that page-based random
> sampling is much less expensive than row-based random
> sampling.   We should really be focusing on methods which
> are page-based.

Of course, that savings comes at the expense of having to
account for factors like clustering within blocks.  So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

```

pgsql-performance by date

 Next: From: Richard Rowell Date: 2005-04-27 16:07:14 Subject: Suggestions for a data-warehouse migration routine Previous: From: Josh Berkus Date: 2005-04-27 15:25:16 Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

pgsql-hackers by date

 Next: From: Bruce Momjian Date: 2005-04-27 16:39:44 Subject: Re: [HACKERS] Continue transactions after errors in psql Previous: From: Josh Berkus Date: 2005-04-27 15:25:16 Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?