Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-08-15 07:38:16
Message-ID: CAPpHfdsKfL_ZAxp4H-6Z3MUy2engekFB9DE7_Yrob9X8rXzxKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 14.08.2012 09:45, Alexander Korotkov wrote:
>
>> After fixing few more bugs, I've a version with much more reasonable
>> accuracy.
>>
>
> Great! One little thing just occurred to me:
>
> You're relying on the regular scalar selectivity estimators for the <<,
> >>, &< and &> operators. That seems bogus, in particular for << and &<,
> because ineq_histogram_selectivity then performs a binary search of the
> histogram using those operators. << and &< compare the *upper* bound of the
> value in table against the lower bound of constant, but the histogram is
> constructed using regular < operator, which sorts the entries by lower
> bound. I think the estimates you now get for those operators are quite
> bogus if there is a non-trivial amount of overlap between ranges. For
> example:
>
> postgres=# create table range_test as
> select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
> range_test;
> SELECT 1000000
> ANALYZE
> postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
> int4range(200000, 200001);
> QUERY PLAN
>
> ------------------------------**------------------------------**
> --------------------
> ------------------------------**-----
> Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14) (actual
> time=0.
> 060..1340.147 rows=200000 loops=1)
> Filter: (r << '[200000,200001)'::int4range)
> Rows Removed by Filter: 800000
> Total runtime: 1371.865 ms
> (4 rows)
>
> It would be quite easy to provide reasonable estimates for those
> operators, if we had a separate histogram of upper bounds. I also note that
> the estimation of overlap selectivity could be implemented using separate
> histograms of lower bounds and upper bounds, without requiring a histogram
> of range lengths, because a && b == NOT (a << b OR a >> b). I'm not sure if
> the estimates we'd get that way would be better or worse than your current
> method, but I think it would be easier to understand.
>
> I don't think the &< and &> operators could be implemented in terms of a
> lower and upper bound histogram, though, so you'd still need the current
> length histogram method for that.
>

Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics. Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

> The code in that traverses the lower bound and length histograms in
> lockstep looks quite scary. Any ideas on how to simplify that? My first
> thought is that there should be helper function that gets a range length as
> argument, and returns the fraction of tuples with length >= argument. It
> would do the lookup in the length histogram to find the right histogram
> bin, and do the linear interpolation within the bin. You're assuming that
> length is independent of lower/upper bound, so you shouldn't need any other
> parameters than range length for that estimation.
>
> You could then loop through only the lower bounds, and call the helper
> function for each bin to get the fraction of ranges long enough in that
> bin, instead dealing with both histograms in the same loop. I think a
> helper function like that might simplify those scary loops significantly,
> but I wasn't sure if there's some more intelligence in the way you combine
> values from the length and lower bound histograms that you couldn't do with
> such a helper function.

Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length >=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2012-08-15 08:14:46 Re: Statistics and selectivity estimation for ranges
Previous Message Pavel Stehule 2012-08-15 04:15:58 Re: betatesting: ERROR: failed to build any 2-way joins on 9.2