Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-10-09 20:08:03
Message-ID: CAPpHfdvPR8whLzjX70zx42Ur2_kjC2uTQW=vkPYwiTEtS=SGgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 1, 2012 at 3:22 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
> > Addon patch is attached. Actually, I don't get your intention of
> > introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
> > leave it as empty frac in distinct stakind or replace this stakind
> > with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
> > patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
> > with STATISTIC_KIND_LENGTH_HISTOGRAM.
>
> Review comments:
>
> 1. In compute_range_stats, you need to guard against the case where
> there is no subdiff function. Perhaps default to 1.0 or something?
>

Let it be 1.0 without subdiff function. However, there is not so much use
of this method of estimation without subdiff. But, probably it's better
than nothing.

2. I think it would be helpful to add comments regarding what happens
> when lengths are identical, right now it's a little confusing. For
> instance, the comment: "Generate a length histogram slot entry if there
> are at least two length values" doesn't seem right, because the
> condition still matches even if there is only one distinct value.
>
I've rephrased comment. Not it's implicitly says that collected values are
not necessary distinct.

> 3. It looks like get_distance also needs to guard against a missing
> subdiff.
>

Same to compute_range_stats. Let default value be 1.0.

> 4. There are 3 binary search functions, which seems a little excessive:
> * rbound_bsearch: greatest i such that hist[i] < v; or -1
> * rbound_bsearch_equal: greatest i such that:
> hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
> * length_hist_bsearch: least i such that hist[i] >= v;
> or length of hist
> (let me know if I misunderstood the definitions)
> At a minimum, we need more consistent and informative names. Also, the
> definition of rbound_bsearch_equal is a little confusing because it's
> looking for the highest index among distinct values, but the lowest
> index among identical values. Do you see a way to refactor these to be a
> little easier to understand?
>

Actually, goal of rbound_bsearch_equal is to find histogram bin to start
interpolation from. I've renamed it to rbound_bsearch_bin and added
corresponding comment.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
range_stat-0.8.patch.gz application/x-gzip 7.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-10-09 20:12:04 Re: Truncate if exists
Previous Message Robert Haas 2012-10-09 20:04:40 Re: Truncate if exists