Re: Improve selectivity estimate for range queries

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve selectivity estimate for range queries
Date: 2019-01-08 07:26:38
Message-ID: 20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <28533(dot)1545411028(at)sss(dot)pgh(dot)pa(dot)us>
> "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > From: Kyotaro HORIGUCHI [mailto:horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp]
> >> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> wrote in
> >> <008701d4983d$02e731c0$08b59540$(at)lab(dot)ntt(dot)co(dot)jp>
> >>> To handle such cases I've thought up of an idea based on a previous
> >>> commit[1] which solved a similar problem related to
> >>> DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
> >>> which shows whether the selectivity
>
> >> The commit [1] added almost no complexity, but this does. Affects many
> >> functions to introduce tighter coupling between operator-selectivity
> >> functions and clause selectivity functions. isdefault travells alone
> >> too long just to bind remote functions. We might need more pricipled
> >> way to handle that.
>
> Yeah, I don't think that this approach is a reasonable tradeoff to
> eliminate a narrow case. In particular, what's been done here to
> clause_selectivity is totally unprincipled and poorly thought out:
> you can't add an output parameter that's set in only some cases.
> Yet, there's no good way to decide how to set it in many cases.
> For instance, what would you do for an AND or OR where some of
> the branches are default estimates and some not?
>
> >> Around the [1], there was a discussion about introducing the notion of
> >> uncertainty to selectivity. The isdefualt is a kind of uncertainty
> >> value indicating '0/100% uncertain'. So my feeling is saying that
> >> it's better that Selectivity has certinty component then building a
> >> selectivity arithmetics involving uncertainty, though I don't have a
> >> clear picture.
>
> > I see. Indeed if we would change Selectivity so that it includes
> > information about default/non-default, such problems I mentioned
> > would be solved. But I'm afraid that this modification would lead
> > to a big impact, since there are lots of functions that calculate
> > selectivity. I also don't have a concreate idea, so I will think
> > about it. Besides that, I'd like to ask community whether such
> > modification would be acceptable or not.
>
> The planner definitely has a lot of problems that could be addressed
> with some sort of uncertainty framework ... but it'd be a huge
> undertaking, which is why nobody's tried yet.

Sure.

> A smaller-footprint way to fix the immediate problem might be to
> change the values of DEFAULT_INEQ_SEL and friends so that they're
> even less likely to be matched by accident. That is, instead of
> 0.3333333333333333, use 0.333333333333342 or some such. It might

Sounds reasonable.

> seem that that's just moving the problem around, but I think it
> might be possible to show that such a value couldn't be computed
> by scalarltsel given a histogram with no more than 10000 members.
> (I haven't tried to actually prove that, but it seems intuitive
> that the set of possible results would be quantized with no more
> than about 5 digits precision.)

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match
x = 0.33333333333334197
d = 0.000000e+00 match
d = 1.000000e+00: 0.000000 match

==== t.c
#include <stdio.h>
#include <math.h>

int test(double x)
{
double d = 1.0;
double d0 = 0;

fprintf(stderr, "x = %19.17f\n", x);
while (d != d0)
{
double z = floor(d * 3);
double z1 = z + 1.0;
double y = d / z;
double y1 = d / z1;

/* Check if both sides of d * 3 doesn't make match */
if (y != x && y1 != x)
{
fprintf(stderr, " d = %e match\n", d0);
fprintf(stderr, " d = %e: %f match\n", d);
return 0;
}

d0 = d;
d = d * 10;
}
}

int main(void)
{
test(0.3333333333333333);
test(0.333333333333342);

return 0;
}
====

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-01-08 07:40:26 Re: OpenBSD versus semaphores
Previous Message Thomas Munro 2019-01-08 07:05:12 Re: OpenBSD versus semaphores