Skip site navigation (1) Skip section navigation (2)

Re: best statistic target for boolean columns

From: Gaetano Mendola <mendola(at)bigfoot(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: best statistic target for boolean columns
Date: 2004-09-27 22:42:06
Message-ID: 4158973E.7010408@bigfoot.com (view raw or flat)
Thread:
Lists: pgsql-performance
Tom Lane wrote:
> Gregory Stark <gsstark(at)mit(dot)edu> writes:
> 
>>No, actually the stats table keeps the n most common values and their
>>frequency (usually in percentage). So really a target of 2 ought to be enough
>>for boolean values. In fact that's all I see in pg_statistic; I'm assuming
>>there's a full histogram somewhere but I don't see it. Where would it be?
> 
> 
> It's not going to be there.  The histogram only covers values that are
> not in the most-frequent-values list, and therefore it won't exist for a
> column that is completely describable by most-frequent-values.
> 
> 
>>However the target also dictates how large a sample of the table to take. A
>>target of two represents a very small sample. So the estimations could be
>>quite far off.
> 
> 
> Right.  The real point of stats target for such columns is that it
> determines how many rows to sample, and thereby indirectly implies
> the accuracy of the statistics.  For a heavily skewed boolean column
> you'd want a high target so that the number of occurrences of the
> infrequent value would be estimated accurately.
> 
> It's also worth noting that the number of rows sampled is driven by the
> largest per-column stats target in the table, and so reducing stats
> target to 2 for a boolean column will save *zero* effort unless all the
> columns in the table are booleans.

Thank you all, now I have more clear how it works.
Btw last time I was thinking: why during an explain analyze we can not use
the information on about the really extracted rows vs the extimated rows ?

Now I'm reading an article, written by the same author that ispired the magic "300"
on analyze.c, about "Self-tuning Histograms". If this is implemented, I understood
we can take rid of "vacuum analyze" for mantain up to date the statistics.
Have someone in his plans to implement it ?
After all the idea is simple: compare during normal selects the extimated rows and
the actual extracted rows then use this "free" information to refine the histograms.




Regards
Gaetano Mendola




In response to

Responses

pgsql-performance by date

Next:From: Neil ConwayDate: 2004-09-27 23:43:45
Subject: Re: best statistic target for boolean columns
Previous:From: Matt ClarkDate: 2004-09-27 21:41:52
Subject: Re: Caching of Queries

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group