Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2013-02-13 13:55:25
Message-ID: CAPpHfdskfvxxJ7b=GFz1WQDHQnvNMEegVKm=SnqNQX9T3grnOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 13, 2013 at 5:28 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 04.01.2013 10:42, Alexander Korotkov wrote:
>
>> /*
>> * Calculate selectivity of "&&" operator using histograms of range lower
>> bounds
>> * and histogram of range lengths.
>> */
>> static double
>> calc_hist_selectivity_overlap(**TypeCacheEntry *typcache, RangeBound
>> *lower,
>> RangeBound *upper, RangeBound
>> *hist_lower, int hist_nvalues,
>> Datum
>> *length_hist_values, int length_hist_nvalues)
>>
>
> We already have code to estimate &&, based on the lower and upper bound
> histograms:
>
> case OID_RANGE_OVERLAP_OP:
>> case OID_RANGE_CONTAINS_ELEM_OP:
>> /*
>> * A && B <=> NOT (A << B OR A >> B).
>> *
>> * "range @> elem" is equivalent to "range &&
>> [elem,elem]". The
>> * caller already constructed the singular range
>> from the element
>> * constant, so just treat it the same as &&.
>> */
>> hist_selec =
>> calc_hist_selectivity_scalar(**typcache,
>> &const_lower, hist_upper,
>>
>> nhist, false);
>> hist_selec +=
>> (1.0 - calc_hist_selectivity_scalar(**typcache,
>> &const_upper, hist_lower,
>>
>> nhist, true));
>> hist_selec = 1.0 - hist_selec;
>> break;
>>
>
> I don't think the method based on lower bound and length histograms is any
> better. In fact, my gut feeling is that it's less accurate. I'd suggest
> dropping that part of the patch.
>

Right. This estimation has an accuracy of histogram, while estimation based
on lower bound and length histograms rely on additional assumption about
independence of lower bound and length histogram. We can sum A << B and A
>> B probabilities because they are mutually exclusive. It's pretty evident
but I would like to mention it in the comments, because typical assumption
about events in statistics calculation is their independence.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2013-02-13 14:01:00 Re: Fractal tree indexing
Previous Message Atri Sharma 2013-02-13 13:54:28 Re: Fractal tree indexing