| From: | "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> | 
|---|---|
| To: | "'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)lists(dot)postgresql(dot)org> | 
| Subject: | RE: Improve selectivity estimate for range queries | 
| Date: | 2018-12-21 08:21:43 | 
| Message-ID: | 009c01d49906$350ee890$9f2cb9b0$@lab.ntt.co.jp | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
Thanks for the comments.
I attach the v2 patch.
> From: Kyotaro HORIGUCHI [mailto:horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp]
> Sent: Friday, December 21, 2018 12:25 PM
> 
> Hello.
> 
> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> wrote in
> <008701d4983d$02e731c0$08b59540$(at)lab(dot)ntt(dot)co(dot)jp>
> > In my environment, the selectivity for id > 0 was 0.99990000000000001,
> > and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> > value of rqlist->hibound and rqlist->lobound are set respectively.
> > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a
> > result, the final selectivity is computed according to
> > DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement
> > was satisfied. However, these selectivities were computed according to
> > the statistics, so the final selectivity had to be calculated from rqlist->hibound +
rqlist->lobound
> - 1.0.
> > My test case might be uncommon, but I think it would cause some problems.
> 
> Yeah, its known behavior as the comment just above. If in your example query, the table weer very
large
> and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure
> how much it matters and your example doesn't show that.
> 
> The behvavior discussed here is introduced by this commit.
> 
> | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
> | Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> | Date:   Tue Nov 9 00:34:46 2004 +0000
> |
> |     Use a hopefully-more-reliable method of detecting default selectivity
> |     estimates when combining the estimates for a range query.  As pointed out
> |     by Miquel van Smoorenburg, the existing check for an impossible combined
> |     result would quite possibly fail to detect one default and one non-default
> |     input.  It seems better to use the default range query estimate in such
> |     cases.  To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
> |     This is a bit ugly because it introduces additional coupling between
> |     clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
> |     there wasn't plenty already...
> 
> > To handle such cases I've thought up of an idea based on a previous
> > commit[1] which solved a similar problem related to
> > DEFAULT_NUM_DISTINCT.  Just like this modification, I added a flag
> > which shows whether the selectivity
> 
> The commit [1] added almost no complexity, but this does. Affects many functions to introduce
tighter
> coupling between operator-selectivity functions and clause selectivity functions.
> isdefault travells alone too long just to bind remote functions. We might need more pricipled way
to
> handle that.
>
Yes, indeed.  But I have not found better way than this approach yet.
 
> > was calculated according to the statistics or not to
> > clauselist_selectivity, and used it as a condition of the if statement
> > instead of
> > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> > I am afraid that changes were more than I had expected, but we can
> > estimate selectivity correctly.
> >
> > Patch attached.  Do you have any thoughts?
> 
> I've just run over the patch, but some have comments.
> 
> +			if (isdefault)
> +				rqelem->lobound_isdefault = true;
> 
> This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds'
> isdefault.
> 
Indeed.  I fixed it.
> As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead
> to similar kind of problem?
>
I didn't encounter similar problems, but as you said, such problems would be occurred due to the
condition, selec != DEFAULT_INEQ_SEL.  I changed these conditions into "!isdefault".
> 
> 
>      Selectivity lobound;        /* Selectivity of a var > something clause */
>      Selectivity hibound;        /* Selectivity of a var < something clause */
> +    bool        lobound_isdefault;    /* lobound is a default selectivity? */
> +    bool        hibound_isdefault;    /* hibound is a default selectivity? */
>  } RangeQueryClause;
> 
> Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity.
> The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is
saying
> that it's better that Selectivity has certinty component then building a selectivity arithmetics
> involving uncertainty, though I don't have a clear picture.
>
I see.  Indeed if we would change Selectivity so that it includes 
information about default/non-default, such problems I mentioned 
would be solved.  But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity.  I also don't have a concreate idea, so I will think 
about it.  Besides that, I'd like to ask community whether such
modification would be acceptable or not.
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center
> 
> > [1]
> > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2
> > 777d0b733220d9029f78817af8ce
> 
> regards.
> 
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
| Attachment | Content-Type | Size | 
|---|---|---|
| v2_improve_selectivity_estimate_for_range_queries.patch | application/octet-stream | 16.2 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nagaura, Ryohei | 2018-12-21 08:34:18 | RE: [suggestion]support UNICODE host variables in ECPG | 
| Previous Message | Edmund Horner | 2018-12-21 08:04:58 | Re: Tid scan improvements |