Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-08-07 04:25:45
Message-ID: CAPpHfds=0729kHnS0EGT3ydte1uBOxczCCVys--nfyqaH+DALA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 6, 2012 at 6:09 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 04.08.2012 12:31, Alexander Korotkov wrote:
>
>> Hackers,
>>
>> attached patch is for collecting statistics and selectivity estimation for
>> ranges.
>>
>> In order to make our estimations accurate for every distribution of
>> ranges, we would collect 2d-distribution of lower and upper bounds of
>> range
>> into some kind of 2d-histogram. However, this patch use some
>> simplification
>> and assume distribution of lower bound and distribution of length to be
>> independent.
>>
>
> Sounds reasonable. Another possibility would be to calculate the average
> length for each lower-bound bin. So you would e.g know the average length
> of values with lower bound between 1-10, and the average length of values
> with lower bound between 10-20, and so forth. Within a bin, you would have
> to assume that the distribution of the lengths is fixed.
>

Interesting idea. AFAICS, if we store average length for each lower-bound
bin, we still have to assume some kind of distribution of range length in
order to do estimates. For example, assume that range length have
exponential distribution. Correspondingly, we've following trade off: we
don't have to assume lower bound distribution to be independent from length
distribution, but we have to assume kind of length distribution. Actually,
I don't know what is better.
Ideally, we would have range length histogram for each lower-bound bin, or
upper-bound histogram for each lower-bound bin. But, storing such amount of
data seems too expensive.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2012-08-07 06:02:22 Re: WIP Patch: Use sortedness of CSV foreign tables for query planning
Previous Message Craig Ringer 2012-08-07 03:22:53 Beta 3