Re: Understanding histograms

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: len(at)pdx(dot)edu
Cc: len(at)cs(dot)pdx(dot)edu, pgsql-performance(at)postgresql(dot)org
Subject: Re: Understanding histograms
Date: 2008-04-30 14:43:11
Message-ID: 18628.1209566591@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

"Len Shapiro" <lenshap(at)gmail(dot)com> writes:
> I asked about n_distinct, whose documentation reads in part "The
> negated form is used when ANALYZE believes that the number of distinct
> values is likely to increase as the table grows". and I asked about
> why ANALYZE believes that the number of distinct values is likely to
> increase. I'm unclear why you quoted to me the documentation on
> stadistinct.

n_distinct is just a view of stadistinct. I assumed you'd poked around
in the code enough to know that ...

>>> The "rows=2" estimate makes sense when const = 1 or 5, but it makes no
>>> sense to me for other values of const not in the MVC list.
>>
>> I'm not sure what estimate you'd expect instead?

> Instead I would expect an estimate of "rows=0" for values of const
> that are not in the MCV list and not in the histogram.

Surely that's not very sane? The MCV list plus histogram generally
don't include every value in the table. IIRC the estimate for values
not present in the MCV list is (1 - sum(MCV frequencies)) divided by
(n_distinct - number of MCV entries), which amounts to assuming that
all values not present in the MCV list occur equally often. The weak
spot of course is that the n_distinct estimate may be pretty inaccurate.

> Where in the source is the code that manipulates the histogram?

commands/analyze.c builds it, and most of the estimation with it
happens in utils/adt/selfuncs.c.

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-04-30 14:46:28 Re: Replication Syatem
Previous Message Pavan Deolasee 2008-04-30 13:36:12 Re: Replication Syatem