Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: "Dave Held" <dave(dot)held(at)arraysg(dot)com>
To: "pgsql-perform" <pgsql-performance(at)postgresql(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-27 15:47:36
Message-ID: 49E94D0CFCD4DB43AFBA928DDD20C8F9026184E1@asg002.asg.local (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
> -----Original Message-----
> From: Josh Berkus [mailto:josh(at)agliodbs(dot)com]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
> 
> [...]
> Actually, it's more to characterize how large of a sample
> we need.  For example, if we sample 0.005 of disk pages, and
> get an estimate, and then sample another 0.005 of disk pages
> and get an estimate which is not even close to the first
> estimate, then we have an idea that this is a table which 
> defies analysis based on small samples.  

I buy that.

> Wheras if the two estimates are < 1.0 stdev apart, we can
> have good confidence that the table is easily estimated. 

I don't buy that.  A negative indication is nothing more than
proof by contradiction.  A positive indication is mathematical
induction over the set, which in this type of context is 
logically unsound.  There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.

> [...]
> Yes, actually.   We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages
> (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are 
> "easily estimated"
> 1 for tables which are not easily estimated by we can't 
> afford to sample a large % of pages.

I don't buy that the first and second need to be different
estimation methods.  I think you can use the same block
sample estimator for both, and simply stop sampling at
different points.  If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases.  However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.

Of course, I would recommend the distinct sample technique
for the third case.

> If we're doing sampling-based estimation, I really don't
> want people to lose sight of the fact that page-based random
> sampling is much less expensive than row-based random
> sampling.   We should really be focusing on methods which 
> are page-based.

Of course, that savings comes at the expense of having to
account for factors like clustering within blocks.  So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Responses

pgsql-performance by date

Next:From: Richard RowellDate: 2005-04-27 16:07:14
Subject: Suggestions for a data-warehouse migration routine
Previous:From: Josh BerkusDate: 2005-04-27 15:25:16
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

pgsql-hackers by date

Next:From: Bruce MomjianDate: 2005-04-27 16:39:44
Subject: Re: [HACKERS] Continue transactions after errors in psql
Previous:From: Josh BerkusDate: 2005-04-27 15:25:16
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group