Re: Statistics and selectivity estimation for ranges

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-09-30 23:22:01
Message-ID: 1349047321.15580.17.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
> Addon patch is attached. Actually, I don't get your intention of
> introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
> leave it as empty frac in distinct stakind or replace this stakind
> with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
> patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
> with STATISTIC_KIND_LENGTH_HISTOGRAM.

Review comments:

1. In compute_range_stats, you need to guard against the case where
there is no subdiff function. Perhaps default to 1.0 or something?

2. I think it would be helpful to add comments regarding what happens
when lengths are identical, right now it's a little confusing. For
instance, the comment: "Generate a length histogram slot entry if there
are at least two length values" doesn't seem right, because the
condition still matches even if there is only one distinct value.

3. It looks like get_distance also needs to guard against a missing
subdiff.

4. There are 3 binary search functions, which seems a little excessive:
* rbound_bsearch: greatest i such that hist[i] < v; or -1
* rbound_bsearch_equal: greatest i such that:
hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
* length_hist_bsearch: least i such that hist[i] >= v;
or length of hist
(let me know if I misunderstood the definitions)
At a minimum, we need more consistent and informative names. Also, the
definition of rbound_bsearch_equal is a little confusing because it's
looking for the highest index among distinct values, but the lowest
index among identical values. Do you see a way to refactor these to be a
little easier to understand?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2012-10-01 01:15:06 Re: gistchoose vs. bloat
Previous Message Andres Freund 2012-09-30 21:42:56 Re: embedded list v3