Re: Making clausesel.c Smarter

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making clausesel.c Smarter
Date: 2017-04-04 11:21:04
Message-ID: CAKJS1f96Lmsvgfeh8zP-9mZ6=+=U9LGXPKFzjCwpF1agip7s6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 April 2017 at 13:35, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> On Mon, Apr 3, 2017 at 8:52 PM, David Rowley
>> <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>>>> One last observation:
>>>>
>>>> + /*
>>>> + * An IS NOT NULL test is a no-op if there's any other strict quals,
>>>> + * so if that's the case, then we'll only apply this, otherwise we'll
>>>> + * ignore it.
>>>> + */
>>>> + else if (cslist->selmask == CACHESEL_NOTNULLTEST)
>>>> + s1 *= cslist->notnullsel;
>>>>
>>>> In some paths, the range selectivity estimation code punts and returns
>>>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
>>>> present, it should still be applied. It could make a big difference if
>>>> the selectivity for the nulltest is high.
>>>
>>> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL
>>> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test
>>> to exists to estimate that properly. I don't think that needs done as
>>> part of this patch. I'd rather limit the scope since "returned with
>>> feedback" is already knocking at the door of this patch.
>>
>> I agree, but this patch regresses for those cases if the user
>> compensated with a not null test, leaving little recourse, so I'd fix
>> it in this patch to avoid the regression.
>>
>> Maybe you're right that the null fraction should be applied regardless
>> of whether there's an explicit null check, though.
>
> The more I read that code, the more I'm convinced there's no way to
> get a DEFAULT_RANGE_INEQ_SEL if (rqlist->hibound != DEFAULT_INEQ_SEL
> &&
> rqlist->lobound != DEFAULT_INEQ_SEL) other than from contradictory
> clauses, a case the current code handles very poorly, returning
> DEFAULT_RANGE_INEQ_SEL and ignoring nullfrac, something that could
> give way-off estimates if the table had lots of nulls.
>
> Say, consider if "value" was mostly null and you write:
>
> SELECT * FROM testdata WHERE value BETWEEN 10 AND 20 AND value BETWEEN
> 50 AND 60;
>
> All other cases I can think of would end with either hibound or
> lobound being equal to DEFAULT_INEQ_SEL.
>
> It seems to me, doing a git blame, that the checks in the else branch
> were left only as a precaution, one that seems unnecessary.
>
> If you ask me, I'd just leave:
>
> + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
> DEFAULT_INEQ_SEL)
> + {
> + /* No point in checking null selectivity, DEFAULT_INEQ_SEL
> implies we have no stats */
> + s2 = DEFAULT_RANGE_INEQ_SEL;
> + }
> + else
> + {
> ...
> + s2 = rqlist->hibound + rqlist->lobound - 1.0;
> + nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
> + notnullsel = 1.0 - nullsel;
> +
> + /* Adjust for double-exclusion of NULLs */
> + s2 += nullsel;
> + if (s2 <= 0.0) {
> + if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
> + {
> + /* Most likely contradictory clauses found */
> + s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
> + }
> + else
> + {
> + /* Could be a rounding error */
> + s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
> + }
> + }
> + }
>
> Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
> cautious) estimation of the amount of rounding error that could be
> there with 32-bit floats.
>
> The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
> an estimation based on stats, maybe approximate, but one that makes
> sense (ie: preserves the monotonicity of the CPF), and as such
> negative values are either sign of a contradiction or rounding error.
> The previous code left the possibility of negatives coming out of
> default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
> but that doesn't seem possible coming out of scalarineqsel.
>
> But I'd like it if Tom could comment on that, since he's the one that
> did that in commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a (way back
> in 2004).
>
> Barring that, I'd just change the
>
> s2 = DEFAULT_RANGE_INEQ_SEL;
>
> To
>
> s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>
> Which makes a lot more sense.

I think to fix this properly the selfuncs would have to return the
estimate, and nullfrac separately, so the nullfrac could just be
applied once per expression. There's already hacks in there to get rid
of the double null filtering. I'm not proposing that as something for
this patch. It would be pretty invasive to change this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-04-04 11:29:38 Re: show "aggressive" or not in autovacuum logs
Previous Message David Rowley 2017-04-04 11:12:12 Re: Making clausesel.c Smarter