Re: [GENERAL] how to get accurate values in pg_statistic (continued)

From: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: [GENERAL] how to get accurate values in pg_statistic (continued)
Date: 2003-09-10 22:22:04
Message-ID: 60znhcxp0z.fsf@dev6.int.libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

pgman(at)candle(dot)pha(dot)pa(dot)us (Bruce Momjian) writes:
> Tom Lane wrote:
>> Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
>> > Stephan Szabo kindly responded to our earlier queries suggesting
>> > we look at default_statistics_target and ALTER TABLE ALTER COLUMN
>> > SET STATISTICS.
>>
>> > These determine the number of bins in the histogram for a given
>> > column. But for a large number of rows (for example 6 million)
>> > the maximum value (1000) does not guarantee that ANALYZE will do
>> > a full scan of the table. We do not see a way to guarantee the
>> > same statistics run to run without forcing ANALYZE to examine
>> > every row of every table.
>>
>> Do you actually still have a problem with the plans changing when
>> the stats target is above 100 or so? I think the notion of "force
>> ANALYZE to do a full scan" is inherently wrongheaded ... it
>> certainly would not produce numbers that have anything to do with
>> ordinary practice.
>>
>> If you have data statistics that are so bizarre that the planner
>> still gets things wrong with a target of 1000, then I'd like to
>> know more about why.
>
> Has there been any progress in determining if the number of default
> buckets (10) is the best value?

I would think this is much more the key to the issue for their
benchmark than issues of correctly replicating the random number
generator.

I'm not clear on how data is collected into the histogram bins;
obviously it's not selecting all 6 million rows, but how many rows is
it?

The "right answer" for most use seems likely to involve:

a) Getting an appropriate number of bins (I suspect 10 is a bit
small, but I can't justify that mathematically), and
b) Attaching an appropriate sample size to those bins.

What is apparently going wrong with the benchmark (and this can
doubtless arise in "real life," too) is that the random selection is
pulling too few records with the result that some of the bins are
being filled in a "skewed" manner that causes the optimizer to draw
the wrong conclusions. (I may merely be restating the obvious here,
but if I say it a little differently than it has been said before,
someone may notice the vital "wrong assumption.")

If the samples are crummy, then perhaps:
- There need to be more bins
- There need to be more samples

Does the sample size change if you increase the number of bins? If
not, then having more, smaller bins will lead to them getting
increasingly skewed if there is any accidental skew in the selection.

Do we also need a parameter to control sample size?
--
output = reverse("ofni.smrytrebil" "@" "enworbbc")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mary Edie Meredith 2003-09-11 00:17:15 Re: [osdldbt-general] Re: [GENERAL] how to get accurate
Previous Message William Yu 2003-09-10 22:08:48 Re: Reading data in bulk - help?