Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-13 00:31:30
Message-ID: 19b77ce3-83ec-c04b-ff7a-02475f044f4d@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/10/19 6:09 PM, Dean Rasheed wrote:
> On Wed, 26 Dec 2018 at 22:09, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Attached is an updated version of the patch - rebased and fixing the
>> warnings reported by Thomas Munro.
>>
>
> Here are a few random review comments based on what I've read so far:
>
>
> On the CREATE STATISTICS doc page, the syntax in the new examples
> added to the bottom of the page is incorrect. E.g., instead of
>
> CREATE STATISTICS s2 WITH (mcv) ON (a, b) FROM t2;
>
> it should read
>
> CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
>

Fixed.

> I think perhaps there should be also be a short explanatory sentence
> after each example (as in the previous one) just to explain what the
> example is intended to demonstrate. E.g., for the new MCV example,
> perhaps say
>
> These statistics give the planner more detailed information about the
> specific values that commonly appear in the table, as well as an upper
> bound on the selectivities of combinations of values that do not appear in
> the table, allowing it to generate better estimates in both cases.
>
> I don't think there's a need for too much detail there, since it's
> explained more fully elsewhere, but it feels like it needs a little
> more just to explain the purpose of the example.
>

I agree, this part of docs can be quite terse. I've adopted the wording
you proposed, and I've done something similar for the histogram patch,
which needs to add something too. It's a bit repetitive, though.

>
> There is additional documentation in perform.sgml that needs updating
> -- about what kinds of stats the planner keeps. Those docs are
> actually quite similar to the ones on planstats.sgml. It seems the
> former focus more one what stats the planner stores, while the latter
> focus on how the planner uses those stats.
>

OK, I've expanded this part a bit too.

>
> In func.sgml, the docs for pg_mcv_list_items need extending to include
> the base frequency column. Similarly for the example query in
> planstats.sgml.
>

Fixed.

>
> Tab-completion for the CREATE STATISTICS statement should be extended
> for the new kinds.
>

Fixed.

>
> Looking at mcv_update_match_bitmap(), it's called 3 times (twice
> recursively from within itself), and I think the pattern for calling
> it is a bit messy. E.g.,
>
> /* by default none of the MCV items matches the clauses */
> bool_matches = palloc0(sizeof(char) * mcvlist->nitems);
>
> if (or_clause(clause))
> {
> /* OR clauses assume nothing matches, initially */
> memset(bool_matches, STATS_MATCH_NONE, sizeof(char) *
> mcvlist->nitems);
> }
> else
> {
> /* AND clauses assume everything matches, initially */
> memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
> mcvlist->nitems);
> }
>
> /* build the match bitmap for the OR-clauses */
> mcv_update_match_bitmap(root, bool_clauses, keys,
> mcvlist, bool_matches,
> or_clause(clause));
>
> the comment for the AND case directly contradicts the initial comment,
> and the final comment is wrong because it could be and AND clause. For
> a NOT clause it does:
>
> /* by default none of the MCV items matches the clauses */
> not_matches = palloc0(sizeof(char) * mcvlist->nitems);
>
> /* NOT clauses assume nothing matches, initially */
> memset(not_matches, STATS_MATCH_FULL, sizeof(char) *
> mcvlist->nitems);
>
> /* build the match bitmap for the NOT-clause */
> mcv_update_match_bitmap(root, not_args, keys,
> mcvlist, not_matches, false);
>
> so the second comment is wrong. I understand the evolution that lead
> to this function existing in this form, but I think that it can now be
> refactored into a "getter" function rather than an "update" function.
> I.e., something like mcv_get_match_bitmap() which first allocates the
> array to be returned and initialises it based on the passed-in value
> of is_or. That way, all the calling sites can be simplified to
> one-liners like
>
> /* get the match bitmap for the AND/OR clause */
> bool_matches = mcv_get_match_bitmap(root, bool_clauses, keys,
> mcvlist, or_clause(clause));
>

Yes, I agree. I've reworked the function per your proposal, and I've
done the same for the histogram too.

>
> In the previous discussion around UpdateStatisticsForTypeChange(), the
> consensus appeared to be that we should just unconditionally drop all
> extended statistics when ALTER TABLE changes the type of an included
> column (just as we do for per-column stats), since such a type change
> can rewrite the data in arbitrary ways, so there's no reason to assume
> that the old stats are still valid. I think it makes sense to extract
> that as a separate patch to be committed ahead of these ones, and I'd
> also argue for back-patching it.
>

Wasn't the agreement to keep stats that don't include column values
(functional dependencies and ndistinct coefficients), and reset only
more complex stats? That's what happens in master and how it's extended
by the patch for MCV lists and histograms.

>
> That's it for now. I'll try to keep reviewing if time permits.
>

Thanks!

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-multivariate-MCV-lists.patch.gz application/gzip 40.6 KB
0002-multivariate-histograms.patch.gz application/gzip 47.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Martinez 2019-01-13 00:55:12 PATCH: Include all columns in default names for foreign key constraints.
Previous Message Tomas Vondra 2019-01-13 00:04:00 Re: [HACKERS] PATCH: multivariate histograms and MCV lists