RE: Improve selectivity estimate for range queries

From: "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Improve selectivity estimate for range queries
Date: 2018-12-21 08:21:43
Message-ID: 009c01d49906$350ee890$9f2cb9b0$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the comments.
I attach the v2 patch.

> From: Kyotaro HORIGUCHI [mailto:horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp]
> Sent: Friday, December 21, 2018 12:25 PM
>
> Hello.
>
> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> wrote in
> <008701d4983d$02e731c0$08b59540$(at)lab(dot)ntt(dot)co(dot)jp>
> > In my environment, the selectivity for id > 0 was 0.99990000000000001,
> > and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> > value of rqlist->hibound and rqlist->lobound are set respectively.
> > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a
> > result, the final selectivity is computed according to
> > DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement
> > was satisfied. However, these selectivities were computed according to
> > the statistics, so the final selectivity had to be calculated from rqlist->hibound +
rqlist->lobound
> - 1.0.
> > My test case might be uncommon, but I think it would cause some problems.
>
> Yeah, its known behavior as the comment just above. If in your example query, the table weer very
large
> and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure
> how much it matters and your example doesn't show that.
>
> The behvavior discussed here is introduced by this commit.
>
> | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
> | Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> | Date: Tue Nov 9 00:34:46 2004 +0000
> |
> | Use a hopefully-more-reliable method of detecting default selectivity
> | estimates when combining the estimates for a range query. As pointed out
> | by Miquel van Smoorenburg, the existing check for an impossible combined
> | result would quite possibly fail to detect one default and one non-default
> | input. It seems better to use the default range query estimate in such
> | cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
> | This is a bit ugly because it introduces additional coupling between
> | clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
> | there wasn't plenty already...
>
> > To handle such cases I've thought up of an idea based on a previous
> > commit[1] which solved a similar problem related to
> > DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
> > which shows whether the selectivity
>
> The commit [1] added almost no complexity, but this does. Affects many functions to introduce
tighter
> coupling between operator-selectivity functions and clause selectivity functions.
> isdefault travells alone too long just to bind remote functions. We might need more pricipled way
to
> handle that.
>
Yes, indeed. But I have not found better way than this approach yet.

> > was calculated according to the statistics or not to
> > clauselist_selectivity, and used it as a condition of the if statement
> > instead of
> > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> > I am afraid that changes were more than I had expected, but we can
> > estimate selectivity correctly.
> >
> > Patch attached. Do you have any thoughts?
>
> I've just run over the patch, but some have comments.
>
> + if (isdefault)
> + rqelem->lobound_isdefault = true;
>
> This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds'
> isdefault.
>
Indeed. I fixed it.

> As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead
> to similar kind of problem?
>
I didn't encounter similar problems, but as you said, such problems would be occurred due to the
condition, selec != DEFAULT_INEQ_SEL. I changed these conditions into "!isdefault".

>
>
> Selectivity lobound; /* Selectivity of a var > something clause */
> Selectivity hibound; /* Selectivity of a var < something clause */
> + bool lobound_isdefault; /* lobound is a default selectivity? */
> + bool hibound_isdefault; /* hibound is a default selectivity? */
> } RangeQueryClause;
>
> Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity.
> The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is
saying
> that it's better that Selectivity has certinty component then building a selectivity arithmetics
> involving uncertainty, though I don't have a clear picture.
>
I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

Best regards,

Yuzuko Hosoya
NTT Open Source Software Center

>
> > [1]
> > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2
> > 777d0b733220d9029f78817af8ce
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

Attachment Content-Type Size
v2_improve_selectivity_estimate_for_range_queries.patch application/octet-stream 16.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nagaura, Ryohei 2018-12-21 08:34:18 RE: [suggestion]support UNICODE host variables in ECPG
Previous Message Edmund Horner 2018-12-21 08:04:58 Re: Tid scan improvements