Re: Statistics and selectivity estimation for ranges

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-08-14 15:46:18
Message-ID: 502A72CA.80704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14.08.2012 09:45, Alexander Korotkov wrote:
> After fixing few more bugs, I've a version with much more reasonable
> accuracy.

Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<,
>>, &< and &> operators. That seems bogus, in particular for << and &<,
because ineq_histogram_selectivity then performs a binary search of the
histogram using those operators. << and &< compare the *upper* bound of
the value in table against the lower bound of constant, but the
histogram is constructed using regular < operator, which sorts the
entries by lower bound. I think the estimates you now get for those
operators are quite bogus if there is a non-trivial amount of overlap
between ranges. For example:

postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14)
(actual time=0.
060..1340.147 rows=200000 loops=1)
Filter: (r << '[200000,200001)'::int4range)
Rows Removed by Filter: 800000
Total runtime: 1371.865 ms
(4 rows)

It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note
that the estimation of overlap selectivity could be implemented using
separate histograms of lower bounds and upper bounds, without requiring
a histogram of range lengths, because a && b == NOT (a << b OR a >> b).
I'm not sure if the estimates we'd get that way would be better or worse
than your current method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.

The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length
as argument, and returns the fraction of tuples with length >= argument.
It would do the lookup in the length histogram to find the right
histogram bin, and do the linear interpolation within the bin. You're
assuming that length is independent of lower/upper bound, so you
shouldn't need any other parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops
significantly, but I wasn't sure if there's some more intelligence in
the way you combine values from the length and lower bound histograms
that you couldn't do with such a helper function.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-08-14 16:04:12 Re: WIP patch for consolidating misplaced-aggregate checks
Previous Message Alvaro Herrera 2012-08-14 15:30:29 Re: WIP patch for consolidating misplaced-aggregate checks