Re: Improve selectivity estimate for range queries

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve selectivity estimate for range queries
Date: 2018-12-21 03:24:36
Message-ID: 20181221.122436.238213221.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> wrote in <008701d4983d$02e731c0$08b59540$(at)lab(dot)ntt(dot)co(dot)jp>
> In my environment, the selectivity for id > 0 was 0.99990000000000001,
> and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> value of rqlist->hibound and rqlist->lobound are set respectively.
> On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result,
> the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
> since the condition of the second if statement was satisfied. However,
> these selectivities were computed according to the statistics, so the
> final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.
> My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your
example query, the table weer very large and had an index it
could run faultly an index scan over a fair amount of tuples. But
I'm not sure how much it matters and your example doesn't show
that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
| Date: Tue Nov 9 00:34:46 2004 +0000
|
| Use a hopefully-more-reliable method of detecting default selectivity
| estimates when combining the estimates for a range query. As pointed out
| by Miquel van Smoorenburg, the existing check for an impossible combined
| result would quite possibly fail to detect one default and one non-default
| input. It seems better to use the default range query estimate in such
| cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
| This is a bit ugly because it introduces additional coupling between
| clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
| there wasn't plenty already...

> To handle such cases I've thought up of an idea based on a previous commit[1]
> which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like
> this modification, I added a flag which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects
many functions to introduce tighter coupling between
operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote
functions. We might need more pricipled way to handle that.

> was calculated according to the statistics or not to clauselist_selectivity,
> and used it as a condition of the if statement instead of
> rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> I am afraid that changes were more than I had expected, but we can estimate
> selectivity correctly.
>
> Patch attached. Do you have any thoughts?

I've just run over the patch, but some have comments.

+ if (isdefault)
+ rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose
it should be the last lobounds' isdefault.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist
in mergejoinsel. Don't they lead to similar kind of problem?

Selectivity lobound; /* Selectivity of a var > something clause */
Selectivity hibound; /* Selectivity of a var < something clause */
+ bool lobound_isdefault; /* lobound is a default selectivity? */
+ bool hibound_isdefault; /* hibound is a default selectivity? */
} RangeQueryClause;

Around the [1], there was a discussion about introducing the
notion of uncertainty to selectivity. The isdefualt is a kind of
uncertainty value indicating '0/100% uncertain'. So my feeling is
saying that it's better that Selectivity has certinty component
then building a selectivity arithmetics involving uncertainty,
though I don't have a clear picture.

> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-21 03:31:16 Re: Tid scan improvements
Previous Message Kyotaro HORIGUCHI 2018-12-21 03:20:27 Re: [HACKERS] Macros bundling RELKIND_* conditions