Re: Understanding histograms

From: "Len Shapiro" <lenshap(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: len(at)cs(dot)pdx(dot)edu, pgsql-performance(at)postgresql(dot)org
Subject: Re: Understanding histograms
Date: 2008-04-30 06:32:18
Message-ID: c5ee9b8a0804292332q32b468e3ga6b99e25b56c18c7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom,

Thank you for your prompt reply.

On Tue, Apr 29, 2008 at 10:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Len Shapiro <len(at)cs(dot)pdx(dot)edu> writes:
> > 1. Why does Postgres come up with a negative n_distinct?
>
> It's a fractional representation. Per the docs:
>
> > stadistinct float4 The number of distinct nonnull data values in the column. A value greater than zero is the actual number of distinct values. A value less than zero is the negative of a fraction of the number of rows in the table (for example, a column in which values appear about twice on the average could be represented by stadistinct = -0.5). A zero value means the number of distinct values is unknown

I asked about n_distinct, whose documentation reads in part "The
negated form is used when ANALYZE believes that the number of distinct
values is likely to increase as the table grows". and I asked about
why ANALYZE believes that the number of distinct values is likely to
increase. I'm unclear why you quoted to me the documentation on
stadistinct.
>
>
> > The "rows=2" estimate makes sense when const = 1 or 5, but it makes no
> > sense to me for other values of const not in the MVC list.
> > For example, if I run the query
> > EXPLAIN SELECT * from sailors where rank = -1000;
> > Postgres still gives an estimate of "row=2".
>
> I'm not sure what estimate you'd expect instead?

Instead I would expect an estimate of "rows=0" for values of const
that are not in the MCV list and not in the histogram. When the
histogram has less than the maximum number of entries, implying (I am
guessing here) that all non-MCV values are in the histogram list, this
seems like a simple strategy and has the virtue of being accurate.

Where in the source is the code that manipulates the histogram?

> The code has a built in
> assumption that no value not present in the MCV list can be more
> frequent than the last member of the MCV list, so it's definitely not
> gonna guess *more* than 2.

That's interesting. Where is this in the source code?

Thanks for all your help.

All the best,

Len Shapiro

> regards, tom lane
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Pavan Deolasee 2008-04-30 06:43:11 Re: Replication Syatem
Previous Message Gauri Kanekar 2008-04-30 05:39:56 Re: Replication Syatem