Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-08-30 15:59:50
Message-ID: e16416cc-51a0-dda1-64d5-d07ae34f0ea4@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/28/21 6:30 PM, Mark Dilger wrote:
>
>
>> On Aug 28, 2021, at 6:52 AM, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Part 0003 fixes handling of those clauses so that we don't treat
>> them as simple, but it does that by tweaking
>> statext_is_compatible_clause(), as suggested by Dean.
>
> Function examine_opclause_args() doesn't set issimple to anything in
> the IsA(rightop, Const) case, but assigns *issimplep = issimple at
> the bottom. The compiler is not complaining about using a possibly
> uninitialized variable, but if I change the "return true" on the very
> next line to "return issimple", the compiler complains quite loudly.
>

Yeah, true. Thanks for noticing this was a bug - I forgot to set the
issimple variable in the first branch.

>
> Some functions define bool *issimple, others bool *issimplep and bool
> issimple. You might want to standardize the naming.
>

I think the naming is standard with respect to the surrounding code. If
the other parameters use "p" to mark "pointer" then issimplep is used,
but in other places it's just "issimple". IMHO this is appropriate.

> It's difficult to know what "simple" means in extended_stats.c.
> There is no file-global comment explaining the concept, and functions
> like compare_scalars_simple don't have correlates named
> compare_scalars_complex or such, so the reader cannot infer by
> comparison what the difference might be between a "simple" case and
> some non-"simple" case. The functions' issimple (or issimplep)
> argument are undocumented.
>
> There is a comment:
>
> /* * statext_mcv_clauselist_selectivity * Estimate clauses using
> the best multi-column statistics. .... * * - simple selectivity:
> Computed without extended statistics, i.e. as if the *
> columns/clauses were independent. * .... */
>
> but it takes a while to find if you search for "issimple".
>

Yeah, true. This was added a while ago when Dean reworked the estimation
(based on MCV), and it seemed clear back then. But now a comment
explaining this concept (and how it affects the estimation) would be
helpful. I'll try digging in the archives for the details.

>
> In both scalarineqsel_wrapper() and eqsel_internal(), the call to
> matching_restriction_variables() should usually return false, since
> comparing a variable to itself is an unusual case. The next call is
> to get_restriction_variable(), which repeats the work of examining
> the left and right variables. So in almost all cases, after throwing
> away the results of:
>
> examine_variable(root, left, varRelid, &ldata);
> examine_variable(root, right, varRelid, &rdata);
>
> performed in matching_restriction_variables(), we'll do exactly the
> same work again (with one variable named differently) in
> get_restriction_variable():
>
> examine_variable(root, left, varRelid, vardata);
> examine_variable(root, right, varRelid, &rdata);
>
> That'd be fine if example_variable() were a cheap function, but it
> appears not to be. Do you think you could save the results rather
> than recomputing them? It's a little messy, since these are the only
> two functions out of about ten which follow this pattern, so you'd
> have to pass NULLs into get_restriction_variable() from the other
> eight callers, but it still looks like that would be a win.
>

I had similar concerns, although I don't think those functions are very
expensive compared to the rest of the estimation code. I haven't done
any measurements yet, though.

But I don't think saving the results is the way to go - in a way, we
already store the stats (which seems like the most expensive bit) in
syscache. It seems better to just simplify examine_variable() so that it
does not lookup the statistics, which we don't need here at all.

The attached version of the patches fixes the other bugs reported here
so far - most importantly it reworks how we set issimple while examining
the clauses, so that it's never skips the initialization. Hopefully the
added comments also explain it a bit more clearly.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Improve-estimates-for-Var-op-Var-with-the-s-20210830.patch text/x-patch 4.3 KB
0002-main-patch-20210830.patch text/x-patch 25.6 KB
0003-Don-t-treat-Var-op-Var-as-simple-clauses-20210830.patch text/x-patch 12.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2021-08-30 16:09:14 Re: Patch: shouldn't timezone(text, timestamp[tz]) be STABLE?
Previous Message Justin Pryzby 2021-08-30 15:42:49 Re: pg_dump, ATTACH, and independently restorable child partitions