Re: WIP patch: distinguish selectivity of < from <= and > from >=

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: distinguish selectivity of < from <= and > from >=
Date: 2017-07-04 17:26:53
Message-ID: CAGz5QCKEhkH81xxdU5pJSRwTQGWvZOkj_rLGv_ZVV0uQeYm7ZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 4, 2017 at 9:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> writes:
>> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> ... I have to admit that I've failed to wrap my brain around exactly
>>> why it's correct. The arguments that I've constructed so far seem to
>>> point in the direction of applying the opposite correction, which is
>>> demonstrably wrong. Perhaps someone whose college statistics class
>>> wasn't quite so long ago can explain this satisfactorily?
>
>> I guess that you're referring the last case, i.e.
>> explain analyze select * from tenk1 where thousand between 10 and 10;
>
> No, the thing that is bothering me is why it seems to be correct to
> apply a positive correction for ">=", a negative correction for "<",
> and no correction for "<=" or ">". That seems weird and I can't
> construct a plausible explanation for it. I think it might be a
> result of the fact that, given a discrete distribution rather than
> a continuous one, the histogram boundary values should be understood
> as having some "width" rather than being zero-width points on the
> distribution axis. But the arguments I tried to fashion on that
> basis led to other rules that didn't actually work.
>
> It's also possible that this logic is in fact wrong and it just happens
> to give the right answer anyway for uniformly-distributed cases.
>
So, here are two points I think:
1. When should we apply(add/subtract) the correction?
2. What should be the correction?

The first point:
there can be further two cases,
a) histfrac - actual_selectivity(p<=0) = 0.
For this case, I've an explanation why applying the correction in
above way works correctly. (This is the case with tenk1)
Since, histfrac correctly calculates selectivity for (p<=0),
hist_selec will either be <= or > (isgt is true). Hence, there is no
need to apply the correction. A negative correction is needed for less
than operator (sel(<=10) - sel(=10)). Similarly, a positive correction
is needed for greater than and equals to operator (sel(>10) +
sel(=10)).

b) histfrac - actual_selectivity(p<=0) != 0.
This is possible when we've high variance in the histogram buckets. I
guess here things may go wrong. Please consider the following example,
UPDATE tenk1 SET thousand=11 where thousand=10;
VACUUM ANALYZE tenk1;
explain analyze select * from tenk1 where thousand between 10 and 10;
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244)
(actual time=0.018..0.018 rows=0 loops=1)

The second point:
In this case, the correction is calculated correctly as selectivity of
(p=0) because of uniform distribution. Hence, it works. When we don't
have uniform distribution, the current calculation of the correction
may prove to be over-estimation for most of the time.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2017-07-04 18:43:08 Re: WIP patch: distinguish selectivity of < from <= and > from >=
Previous Message Tom Lane 2017-07-04 17:24:08 Re: WIP patch: distinguish selectivity of < from <= and > from >=