Re: Improve selectivity estimate for range queries

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve selectivity estimate for range queries
Date: 2019-01-24 12:50:56
Message-ID: 20190124.215056.255027863.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

At Fri, 11 Jan 2019 11:36:47 +0900, "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> wrote in <000001d4a956$806a2ab0$813e8010$(at)lab(dot)ntt(dot)co(dot)jp>
> Hi,
>
> Thanks for the comments, and I'm sorry for the late reply.
>
> > From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> > Sent: Friday, January 11, 2019 7:04 AM
> > > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > > On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> A smaller-footprint way to fix the immediate problem might be to
> > >> change the values of DEFAULT_INEQ_SEL and friends so that they're
> > >> even less likely to be matched by accident. That is, instead of
> > >> 0.3333333333333333, use 0.333333333333342 or some such.
> >
> > > That's not a dumb idea, but it seems pretty unprincipled to me, and to
> > > be honest I'm kind of surprised that you're not proposing something
> > > cleaner.
> >
> > The problem is the invasiveness of such a change (large) vs the benefit (not so large). The
> upthread
> > patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I
> and
> > Horiguchi-san thought it was already too messy.
> >
> > Maybe at some point we'll go over to something reasonably principled, like adding confidence
> intervals
> > to all selectivity estimates. That would be *really* invasive but perhaps would bring enough
> benefit
> > to justify itself. But the current patch is just attempting to fix one extremely narrow pain
> point,
> > and if that is all it's doing it should have a commensurately small footprint. So I don't think
> the
> > submitted patch looks good from a cost/benefit standpoint.
> >
> Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness.
> However, I want to solve problems of that real estimates happened to equal to the default
> values such as this case, even though it's a narrow pain point. So I tried distinguishing
> explicitly between real estimates and otherwise as Robert said.
>
> The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
> any range queries really cannot match 0.333333333333342 (or some such) by accident in any
> environments. Is the way which Horiguchi-san did enough to prove that?

It cannot be perfect in priciple, but I think it is reliable
enough. This is not principled at all but very effective
comparatively the complexity, I think.

Actually, i/(i*3+1) for some 10^n's are:

1/ 4:binary format: 3f d0 00 00 00 00 00 00
10/ 31:binary format: 3f d4 a5 29 4a 52 94 a5
100/ 301:binary format: 3f d5 43 30 7a 78 c5 51
1000/ 3001:binary format: 3f d5 53 83 74 70 f1 95
10000/ 30001:binary format: 3f d5 55 26 bb 44 2b a8
100000/ 300001:binary format: 3f d5 55 50 ac 4a 74 6d
1000000/ 3000001:binary format: 3f d5 55 54 de 07 5a 96
10000000/ 30000001:binary format: 3f d5 55 55 49 67 22 6d
100000000/ 300000001:binary format: 3f d5 55 55 54 23 e9 d7
1000000000/ 3000000001:binary format: 3f d5 55 55 55 36 ca 95
10000000000/ 30000000001:binary format: 3f d5 55 55 55 52 47 75
100000000000/ 300000000001:binary format: 3f d5 55 55 55 55 07 25
1000000000000/ 3000000000001:binary format: 3f d5 55 55 55 55 4d 84
10000000000000/ 30000000000001:binary format: 3f d5 55 55 55 55 54 8d
100000000000000/ 300000000000001:binary format: 3f d5 55 55 55 55 55 41
1000000000000000/ 3000000000000001:binary format: 3f d5 55 55 55 55 55 53

So, (as Tom said upthread) intuitively we can expect that we are
safe up to roughly 10^10? for the simple cases.

# Of course involving histogram makes things complex, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-01-24 12:57:26 Re: using expression syntax for partition bounds
Previous Message Amit Langote 2019-01-24 12:43:20 Re: problems with foreign keys on partitioned tables