Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-22 14:43:35
Message-ID: CAKJS1f-gQVQ6Je+vA6GjK_NKcNAL8haa47=gWet6vgkNMXQ+hQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 18 Jan 2019 at 10:03, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> thanks for the review. The attached patches address most of the issues
> mentioned in the past several messages, both in the MCV and histogram parts.

I made another pass over the 0001 patch. I've not read through mcv.c
again yet. Will try to get to that soon.

0001-multivariate-MCV-lists-20190117.patch

1. The following mentions "multiple functions", but lists just 1 function.

<para>
To inspect statistics defined using <command>CREATE STATISTICS</command>
command, <productname>PostgreSQL</productname> provides multiple functions.
</para>

2. There's a mix of usages of <literal>MCV</literal> and
<acronym>MCV</acronym> around the docs. Should these be the same?

3. analyze_mcv_list() is modified to make it an external function, but
it's not used anywhere out of analyze.c

4. The following can be simplified further:

* We can also leave the record as it is if there are no statistics
* including the datum values, like for example MCV lists.
*/
if (statext_is_kind_built(oldtup, STATS_EXT_MCV))
reset_stats = true;

/*
* If we can leave the statistics as it is, just do minimal cleanup
* and we're done.
*/
if (!reset_stats)
{
ReleaseSysCache(oldtup);
return;
}

to just:

/*
* When none of the defined statistics types contain datum values
* from the table's columns then there's no need to reset the stats.
* Functional dependencies and ndistinct stats should still hold true.
*/
if (!statext_is_kind_built(oldtup, STATS_EXT_MCV))
{
ReleaseSysCache(oldtup);
return;
}

5. "so that we can ignore them below." seems misplaced now since
you've moved all the code below into clauselist_selectivity_simple().
Maybe you can change it to "so that we can inform
clauselist_selectivity_simple about clauses that it should ignore" ?

* filled with the 0-based list positions of clauses used that way, so
* that we can ignore them below.

6. README.mcv: multi-variate -> multivariate

are large the list may be quite large. This is especially true for multi-variate

7. README.mcv: similar -> a similar

it impossible to use anyarrays. It might be possible to produce similar

8. I saw you added IS NOT NULL to README.mcv, but README just mentions:

(b) MCV lists - equality and inequality clauses (AND, OR, NOT), IS NULL

Should that mention IS NOT NULL too?

9. The header comment for build_attnums_array() claims that it
"transforms an array of AttrNumber values into a bitmap", but it does
the opposite.

* Transforms an array of AttrNumber values into a bitmap.

10. The following Assert is not entirely useless. The bitmapset could
have a 0 member, but it can't store negative values.

while ((j = bms_next_member(attrs, j)) >= 0)
{
/* user-defined attributes only */
Assert(AttrNumberIsForUserDefinedAttr(j));

Just checking you thought of that when you added it?

11. XXX comments are normally reserved for things we may wish to
reconsider later, but the following seems more like a "Note:"

* XXX All the memory is allocated in a single chunk, so that the caller
* can simply pfree the return value to release all of it.

12. In statext_is_compatible_clause_internal() there's still a comment
that mentions "Var op Const", but Const op Var is also okay too.

13. This is not fall-through. Generally, such a comment is reserved
to confirm that the "break;" is meant to be missing.

default:
/* fall-through */
return false;

https://developers.redhat.com/blog/2017/03/10/wimplicit-fallthrough-in-gcc-7/
mentions various comment patterns that are used for that case. Your
case seems misplaced since it's right about a return, and not another
case.

14. The header comment for statext_is_compatible_clause() is not
accurate. It mentions only opexprs with equality operations are
allowed, but none of those are true.

* Only OpExprs with two arguments using an equality operator are supported.
* When returning True attnum is set to the attribute number of the Var within
* the supported clause.

15. statext_clauselist_selectivity(): "a number" -> "the number" ?

* Selects the best extended (multi-column) statistic on a table (measured by
* a number of attributes extracted from the clauses and covered by it), and

16. I understand you're changing this to a bitmask in the 0002 patch,
but int is the wrong type here;
/* we're interested in MCV lists */
int types = STATS_EXT_MCV;

Maybe just pass the STATS_EXT_MCV directly, or at least make it a char.

17. bms_membership(clauses_attnums) != BMS_MULTIPLE seems better here.
It can stop once it finds 2. No need to count them all.

/* We need at least two attributes for MCV lists. */
if (bms_num_members(clauses_attnums) < 2)
return 1.0;

18. The following comment in statext_is_compatible_clause_internal()
does not seem to be true. I see OpExprs are supported and NULL test,
including others too.

/* We only support plain Vars for now */

19. The header comment for clauselist_selectivity_simple() does not
mention what estimatedclauses is for.

20. New line. Also, missing "the" before "maximum"

+ * We
+ * iteratively search for multivariate n-distinct with maximum number

21. This comment seems like it's been copied from
estimate_num_groups() without being edited.

/* we're done with this relation */
varinfos = NIL;

Looks like it's using this to break out of the loop.

22. I don't see any dividing going on below this comment:

/*
* Sanity check --- don't divide by zero if empty relation.
*/

23. I see a few tests mentioning: "-- check change of unrelated column
type does not reset the MCV statistics"

Would it be better to just look at pg_statistic_ext there and do something like:

SELECT COUNT(*) FROM pg_statistic_ext WHERE stxname = 'whatever' AND
stxmcv IS NOT NULL;

Otherwise, you seem to be ensuring the stats were not reset by looking
at a query plan, so it's a bit harder to follow and likely testing
more than it needs to.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2019-01-22 14:46:59 Re: TestForOldSnapshot() seems to be in the wrong place
Previous Message Kevin Grittner 2019-01-22 14:33:48 Re: Refactoring the checkpointer's fsync request queue