Re: WIP patch: distinguish selectivity of < from <= and > from >=

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: distinguish selectivity of < from <= and > from >=
Date: 2017-07-09 12:22:24
Message-ID: CAGz5QCL3EZvU3QgMBRxtzO=EGGbZ_irdmsLW15+9Y7Z=Fxx35w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 7, 2017 at 2:53 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> writes:
Wow. Thank you for the wonderful explanation.

>> On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> if histogram_bounds are assigned as,
>> {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........}
>> ...
>> So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
>> which means it assumes val is included in the previous bucket.
>
> I looked at that again and realized that one of the answers I was missing
> lies in the behavior of analyze.c's compute_scalar_stats(). When it has
> collected "nvals" values and wants to make a histogram containing
> "num_hist" entries, it does this:
>
> * The object of this loop is to copy the first and last values[]
> * entries along with evenly-spaced values in between. So the
> * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].
>
> (where i runs from 0 to num_hist - 1). Because the "/" denotes integer
> division, this effectively means that for all but the first entry,
> we are taking the last value out of the corresponding bucket of samples.
> That's obviously true for the last histogram entry, which will be the very
> last sample value, and it's also true for earlier ones --- except for the
> *first* histogram entry, which is necessarily the first one in its bucket.
> So the first histogram bucket is actually a tad narrower than the others,
> which is visible in the typical data you showed above: 0 to 9 is only 9
> counts wide, but all the remaining buckets are 10 counts wide. That
> explains why we need a scale adjustment just in the first bucket.
>
Agreed. But, I've some doubt regarding the amount of adjustment needed.
+ if (i == 1)
+ histfrac += eq_selec * (1.0 - binfrac);
For predicate x<=p, when p lies in the first bucket, following is the amount
of binfrac that we're off from the actual one.
(Assume the same histogram boundaries i.e., 0,9,19,...)

For p=0, (1/10)-(0/9) = 1/10 * (1 - 0)
For p=1, (2/10)-(1/9) = 1/10 * (1 - 1/9)
For p=2, (3/10)-(2/9) = 1/10 * (1 - 2/9)
For p=3, (4/10)-(3/9) = 1/10 * (1 - 3/9)

In general, 1/(high + 1 - low) * (1.0 - binfrac) and the difference in
histfrac is
(1.0 - binfrac) / (high + 1 - low) / num_of_hist_buckets.

So, the above adjustment for the first bucket looks correct when,
otherdistinct ~ (high + 1 - low) * num_of_hist_buckets

Is that always true or I'm missing something?

> I think I'm also beginning to grasp why scalarineqsel's basic estimation
> process produces an estimate for "x <= p" rather than some other condition
> such as "x < p". For clarity, imagine that we're given p equal to the
> last histogram entry. If the test operator is in fact "<=", then it will
> say "true" for every histogram entry, and we'll fall off the end of the
> histogram and return 1.0, which is correct. If the test operator is "<",
> it will return "true" for all but the last entry, so that we end up with
> "i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
> if convert_to_scalar() produces sane answers, again resulting in
> histfrac = 1.0. Similar reasoning applies for ">" and ">=", so that in
> all cases we arrive at histfrac = 1.0 if p equals the last histogram
> entry. More generally, if p is equal to some interior histogram entry,
> say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
> no matter which operator we probe with, assuming that convert_to_scalar()
> correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
> adjacent bins we end up examining. So, remembering that the histogram
> entries occupy the right ends of their buckets, the proper interpretation
> of that is that it's the probability of "x <= p". (This is wrong for the
> first histogram entry, but that's why we need a correction there.)
>
True. In general, histfrac = (k-1)/(n-1) + binfrac where binfrac
depends on the linear interpolation.

For p=some histogram boundary, it'll always be the probability of
"x<=p". When p lies inside the bucket and we've a flat distribution,
then also it'll be the probability of "x<=p". But, when we've high
variance inside a bucket and p lies inside the bucket, linear
interpolation fails to capture the actual probability. But, as you've
mentioned earlier, I guess that's not a new problem.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-07-09 12:30:21 Re: pg_receivewal and messages printed in non-verbose mode
Previous Message Michael Paquier 2017-07-09 12:20:24 Re: pg_stop_backup(wait_for_archive := true) on standby server