Re: On Distributions In 7.2.1

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2.1
Date: 2002-05-02 14:11:50
Message-ID: 7233.1020348710@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> However Tom's observation is still valid (in spite of my math) - all the
> frequencies are overestimated, rather than the expected "some bigger,
> some smaller" sort of thing.

No, that makes sense. The values that get into the most-common-values
list are only going to be ones that are significantly more common (in
the sample) than the estimated average frequency. So if the thing makes
a good estimate of the average frequency, you'll only see upside
outliers in the MCV list. The relevant logic is in analyze.c:

/*
* Decide how many values are worth storing as most-common values.
* If we are able to generate a complete MCV list (all the values
* in the sample will fit, and we think these are all the ones in
* the table), then do so. Otherwise, store only those values
* that are significantly more common than the (estimated)
* average. We set the threshold rather arbitrarily at 25% more
* than average, with at least 2 instances in the sample. Also,
* we won't suppress values that have a frequency of at least 1/K
* where K is the intended number of histogram bins; such values
* might otherwise cause us to emit duplicate histogram bin
* boundaries.
*/

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2002-05-02 14:15:29 Re: Using views and MS access via odbc
Previous Message Tom Lane 2002-05-02 13:51:37 Re: FATAL: stuck spinlock