Re: Use extended statistics to estimate (Var op Var) clauses

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-08-30 22:14:28
Message-ID: CALNJ-vRvZ4PGzCgpOWrahjRrJeP-WnKO=a7FoMJYE_AdvKC7eQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 30, 2021 at 9:00 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> On 8/28/21 6:30 PM, Mark Dilger wrote:
> >
> >
> >> On Aug 28, 2021, at 6:52 AM, Tomas Vondra
> >> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>
> >> Part 0003 fixes handling of those clauses so that we don't treat
> >> them as simple, but it does that by tweaking
> >> statext_is_compatible_clause(), as suggested by Dean.
> >
> > Function examine_opclause_args() doesn't set issimple to anything in
> > the IsA(rightop, Const) case, but assigns *issimplep = issimple at
> > the bottom. The compiler is not complaining about using a possibly
> > uninitialized variable, but if I change the "return true" on the very
> > next line to "return issimple", the compiler complains quite loudly.
> >
>
> Yeah, true. Thanks for noticing this was a bug - I forgot to set the
> issimple variable in the first branch.
>
> >
> > Some functions define bool *issimple, others bool *issimplep and bool
> > issimple. You might want to standardize the naming.
> >
>
> I think the naming is standard with respect to the surrounding code. If
> the other parameters use "p" to mark "pointer" then issimplep is used,
> but in other places it's just "issimple". IMHO this is appropriate.
>
> > It's difficult to know what "simple" means in extended_stats.c.
> > There is no file-global comment explaining the concept, and functions
> > like compare_scalars_simple don't have correlates named
> > compare_scalars_complex or such, so the reader cannot infer by
> > comparison what the difference might be between a "simple" case and
> > some non-"simple" case. The functions' issimple (or issimplep)
> > argument are undocumented.
> >
> > There is a comment:
> >
> > /* * statext_mcv_clauselist_selectivity * Estimate clauses using
> > the best multi-column statistics. .... * * - simple selectivity:
> > Computed without extended statistics, i.e. as if the *
> > columns/clauses were independent. * .... */
> >
> > but it takes a while to find if you search for "issimple".
> >
>
> Yeah, true. This was added a while ago when Dean reworked the estimation
> (based on MCV), and it seemed clear back then. But now a comment
> explaining this concept (and how it affects the estimation) would be
> helpful. I'll try digging in the archives for the details.
>
> >
> > In both scalarineqsel_wrapper() and eqsel_internal(), the call to
> > matching_restriction_variables() should usually return false, since
> > comparing a variable to itself is an unusual case. The next call is
> > to get_restriction_variable(), which repeats the work of examining
> > the left and right variables. So in almost all cases, after throwing
> > away the results of:
> >
> > examine_variable(root, left, varRelid, &ldata);
> > examine_variable(root, right, varRelid, &rdata);
> >
> > performed in matching_restriction_variables(), we'll do exactly the
> > same work again (with one variable named differently) in
> > get_restriction_variable():
> >
> > examine_variable(root, left, varRelid, vardata);
> > examine_variable(root, right, varRelid, &rdata);
> >
> > That'd be fine if example_variable() were a cheap function, but it
> > appears not to be. Do you think you could save the results rather
> > than recomputing them? It's a little messy, since these are the only
> > two functions out of about ten which follow this pattern, so you'd
> > have to pass NULLs into get_restriction_variable() from the other
> > eight callers, but it still looks like that would be a win.
> >
>
> I had similar concerns, although I don't think those functions are very
> expensive compared to the rest of the estimation code. I haven't done
> any measurements yet, though.
>
> But I don't think saving the results is the way to go - in a way, we
> already store the stats (which seems like the most expensive bit) in
> syscache. It seems better to just simplify examine_variable() so that it
> does not lookup the statistics, which we don't need here at all.
>
>
> The attached version of the patches fixes the other bugs reported here
> so far - most importantly it reworks how we set issimple while examining
> the clauses, so that it's never skips the initialization. Hopefully the
> added comments also explain it a bit more clearly.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
Hi,
For patch 0002,

+ s1 = statext_clauselist_selectivity(root, clauses,
varRelid,
+ jointype, sjinfo,
rel,
+ &estimatedclauses,
false);
+
+ estimated = (bms_num_members(estimatedclauses) == 1);

I took a look at clauselist_apply_dependencies() (called by
statext_clauselist_selectivity) where estimatedclauses is modified.
Since the caller would not use the returned Selectivity if number of
elements in estimatedclauses is greater than 1, I wonder
if a parameter can be added to clauselist_apply_dependencies() which
indicates early return if the second element is added to estimatedclauses.

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-08-30 22:39:04 Re: archive status ".ready" files may be created too early
Previous Message Zhihong Yu 2021-08-30 21:11:34 Re: dynamic result sets support in extended query protocol