Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Dave Held <dave(dot)held(at)arraysg(dot)com>,pgsql-perform <pgsql-performance(at)postgresql(dot)org>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-28 17:44:37
Message-ID: 42712105.6030709@kolumbus.fi (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.

(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )

I haven't looked about the PostgreSQL code, so I don't know, that what 
is possible
now, and what is not. I do know, that the full table scan and after that 
incremental
statistics changes are a very big change, without looking at the code.



I meant the following  idea:
- compare two equal sized samples. Then redo the same thing with double
sized samples. So do lots of unnecessary work.
Check out the correlation of the two samples to try to guess the 
distribution.

So I tried to give you an idea, not to give you a full answer into the 
whole problem.


I did read some parts of the attached PDFs. They did convince me,
that it seems, that the heuristics for the hard cases would actually read
almost the whole table in many cases.

I did cover the "too little sample" problem by stating that the
user should be able to give the minimum size of samples. This way you would
avoid the too small sampling problem. My purpose was not to achieve at
most 5% wrong estimates, but to decrease the 2000% wrong estimates, that 
are
seen now sometimes.

Conclusions:
- No heuristics or similar thing of small samples will grant excellent 
results.
- If you need excellent estimates, you need to process the whole table!
- Some special cases, like primary keys and the unique indexes and special
case column types do give easy ways to make estimates:
For example, wether a boolean column has zero, one or two distinct 
values, it does not matter
so much ??? Hashing seems the right choise for all of them.

If I have understund correctly, the full table scans are out of
questions for large tables at this time.

The percentage idea of taking 10% samples seems good.


So here is another suggestion:
1. Do a full percentage scan, starting at an arbitrary position. If the 
user's data is not
homogenous, this hurts it, but this way it is faster.
During that scan, try to figure out all those columns, that have at most 
100 distinct values.

Of course, with it you can't go into 100% accuracy, but if the full 
table scan is out of question now,
it is better, if the accuracy is for example at most ten times wrong.

You could also improve accuracy by instead of doing a 10% partial table 
scan, you could
do 20 pieces of 0,5 percent partial table scans: This would improve 
accuracy a bit, but keep
the speed almost the same as the partial table scan.

Here are questions for the statisticians for distinct values calculation:

If we want at most 1000% tolerance, how big percentage of table's one
column must be processed?

If we want at most 500% tolerance, how big percentage of table's one
column must be processed?

If we want at most 250% tolerance, how big percentage of table's one
column must be processed?

Better to assume, that there are at most 100 distinct values on a table,
if it helps calculations.

If we try to get as much with one discontinuous partial table scan
(0,1-10% sample), here is the information, we can gather:

1. We could gather a histogram for max(100) distinct values for each 
column for every column.
2. We could measure variance and average, and the number of rows for 
these 100 distinct values.
3. We could count the number of rows, that didn't match with these 100 
distinct values:
they were left out from the histogram.
4. We could get a minimum and a maximum value for each column.

=> We could get exact information about the sample with one 0,1-10% pass 
for many columns.

What you statisticans can gather about these values?
My idea is programmatical combined with statistics:
+ Performance: scan for example 100 blocks each of size 100Mb, because 
disc I/O
is much faster this way.
+ Enables larger table percentage. I hope it helps with the statistics 
formula.
    Required because of more robust statistics: take those blocks at random
    (not over each other) places to decrease the effect from hitting 
into statistically
    bad parts on the table.
+ Less table scan passes: scan all columns with limited hashing in the 
first pass.
+ All easy columns are found here with one pass.
+- Harder columns need an own pass each, but we have some preliminary
    knoledge of them on the given sample after all (minimum and maximum 
values
    and the histogram of the 100 distinct values).

Marko Ristola

Greg Stark wrote:

>"Dave Held" <dave(dot)held(at)arraysg(dot)com> writes:
>
>  
>
>>>Actually, it's more to characterize how large of a sample
>>>we need.  For example, if we sample 0.005 of disk pages, and
>>>get an estimate, and then sample another 0.005 of disk pages
>>>and get an estimate which is not even close to the first
>>>estimate, then we have an idea that this is a table which 
>>>defies analysis based on small samples.  
>>>      
>>>
>>I buy that.
>>    
>>
>
>Better yet is to use the entire sample you've gathered of .01 and then perform
>analysis on that sample to see what the confidence interval is. Which is
>effectively the same as what you're proposing except looking at every possible
>partition.
>
>Unfortunately the reality according to the papers that were sent earlier is
>that you will always find the results disappointing. Until your sample is
>nearly the entire table your estimates for n_distinct will be extremely
>unreliable.
>
>  
>


In response to

pgsql-performance by date

Next:From: Enrico WeigeltDate: 2005-04-29 02:35:13
Subject: index on different types
Previous:From: Mischa SandbergDate: 2005-04-28 15:21:36
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

pgsql-hackers by date

Next:From: Michael FuhrDate: 2005-04-28 18:48:09
Subject: Re: Returning a RECORD, not SETOF RECORD
Previous:From: Markus SchaberDate: 2005-04-28 17:35:34
Subject: Re: Statement Timeout and Locking

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group