Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-10 17:09:42
Message-ID: CAEZATCV5Xq7FGFYvigJWPZQrV9Hx37WvHfTgKdf6Ak3smbip2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 26 Dec 2018 at 22:09, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Attached is an updated version of the patch - rebased and fixing the
> warnings reported by Thomas Munro.
>

Here are a few random review comments based on what I've read so far:

On the CREATE STATISTICS doc page, the syntax in the new examples
added to the bottom of the page is incorrect. E.g., instead of

CREATE STATISTICS s2 WITH (mcv) ON (a, b) FROM t2;

it should read

CREATE STATISTICS s2 (mcv) ON a, b FROM t2;

I think perhaps there should be also be a short explanatory sentence
after each example (as in the previous one) just to explain what the
example is intended to demonstrate. E.g., for the new MCV example,
perhaps say

These statistics give the planner more detailed information about the
specific values that commonly appear in the table, as well as an upper
bound on the selectivities of combinations of values that do not appear in
the table, allowing it to generate better estimates in both cases.

I don't think there's a need for too much detail there, since it's
explained more fully elsewhere, but it feels like it needs a little
more just to explain the purpose of the example.

There is additional documentation in perform.sgml that needs updating
-- about what kinds of stats the planner keeps. Those docs are
actually quite similar to the ones on planstats.sgml. It seems the
former focus more one what stats the planner stores, while the latter
focus on how the planner uses those stats.

In func.sgml, the docs for pg_mcv_list_items need extending to include
the base frequency column. Similarly for the example query in
planstats.sgml.

Tab-completion for the CREATE STATISTICS statement should be extended
for the new kinds.

Looking at mcv_update_match_bitmap(), it's called 3 times (twice
recursively from within itself), and I think the pattern for calling
it is a bit messy. E.g.,

/* by default none of the MCV items matches the clauses */
bool_matches = palloc0(sizeof(char) * mcvlist->nitems);

if (or_clause(clause))
{
/* OR clauses assume nothing matches, initially */
memset(bool_matches, STATS_MATCH_NONE, sizeof(char) *
mcvlist->nitems);
}
else
{
/* AND clauses assume everything matches, initially */
memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);
}

/* build the match bitmap for the OR-clauses */
mcv_update_match_bitmap(root, bool_clauses, keys,
mcvlist, bool_matches,
or_clause(clause));

the comment for the AND case directly contradicts the initial comment,
and the final comment is wrong because it could be and AND clause. For
a NOT clause it does:

/* by default none of the MCV items matches the clauses */
not_matches = palloc0(sizeof(char) * mcvlist->nitems);

/* NOT clauses assume nothing matches, initially */
memset(not_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);

/* build the match bitmap for the NOT-clause */
mcv_update_match_bitmap(root, not_args, keys,
mcvlist, not_matches, false);

so the second comment is wrong. I understand the evolution that lead
to this function existing in this form, but I think that it can now be
refactored into a "getter" function rather than an "update" function.
I.e., something like mcv_get_match_bitmap() which first allocates the
array to be returned and initialises it based on the passed-in value
of is_or. That way, all the calling sites can be simplified to
one-liners like

/* get the match bitmap for the AND/OR clause */
bool_matches = mcv_get_match_bitmap(root, bool_clauses, keys,
mcvlist, or_clause(clause));

In the previous discussion around UpdateStatisticsForTypeChange(), the
consensus appeared to be that we should just unconditionally drop all
extended statistics when ALTER TABLE changes the type of an included
column (just as we do for per-column stats), since such a type change
can rewrite the data in arbitrary ways, so there's no reason to assume
that the old stats are still valid. I think it makes sense to extract
that as a separate patch to be committed ahead of these ones, and I'd
also argue for back-patching it.

That's it for now. I'll try to keep reviewing if time permits.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2019-01-10 17:18:35 Re: Policy on cross-posting to multiple lists
Previous Message Alvaro Herrera 2019-01-10 16:43:25 Re: [HACKERS] pgbench - allow to store select results into variables