Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-16 21:26:40
Message-ID: CAEZATCVazGRDjbZpRF6r-Asiv_-U8vcT-VA0oSZribhhmDUQHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 15 Mar 2019 at 00:06, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> ... attached patch ...

Some more review comments, carrying on from where I left off:

16). This regression test fails for me:

@@ -654,11 +654,11 @@
-- check change of unrelated column type does not reset the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a =
1 AND b = ''1''');
estimated | actual
-----------+--------
- 50 | 50
+ 11 | 50
(1 row)

Maybe that's platform-dependent, given what you said about
reltuples/relpages being reset. An easy workaround for this would be
to modify this test (and perhaps the one that follows) to just query
pg_statistic_ext to see if the MCV statistics have been reset.

17). I'm definitely preferring the new style of tests because they're
much neater and easier to read, and to directly see the effect of the
extended statistics. One thing I'd consider adding is a query of
pg_statistic_ext using pg_mcv_list_items() after creating the MCV
stats, both to test that function, and to show that the MCV lists have
the expected contents (provided that output isn't too large).

18). Spurious whitespace added to src/backend/statistics/mvdistinct.c.

19). In the function comment for statext_mcv_clauselist_selectivity(),
the name at the top doesn't match the new function name. Also, I think
it should mention MCV in the initial description. I.e., instead of

+/*
+ * mcv_clauselist_selectivity
+ * Estimate clauses using the best multi-column statistics.

it should say:

+/*
+ * statext_mcv_clauselist_selectivity
+ * Estimate clauses using the best multi-column MCV statistics.

20). Later in the same comment, this part should now be deleted:

+ *
+ * So (simple_selectivity - base_selectivity) may be seen as a correction for
+ * the part not covered by the MCV list.

21). For consistency with other bms_ functions, I think the name of
the Bitmapset argument for bms_member_index() should just be called
"a". Nitpicking, I'd also put bms_member_index() immediately after
bms_is_member() in the source, to match the header.

22). mcv_get_match_bitmap() should really use an array of bool rather
than an array of char. Note that a bool is guaranteed to be of size 1,
so it won't make things any less efficient, but it will allow some
code to be made neater. E.g., all clauses like "matches[i] == false"
and "matches[i] != false" can just be made "!matches[i]" or
"matches[i]". Also the Min/Max expressions on those match flags can be
replaced with the logical operators && and ||.

23). Looking at this code in statext_mcv_build():

/* store info about data type OIDs */
i = 0;
j = -1;
while ((j = bms_next_member(attrs, j)) >= 0)
{
VacAttrStats *colstat = stats[i];

mcvlist->types[i] = colstat->attrtypid;
i++;
}

it isn't actually making use of the attribute numbers (j) from attrs,
so this could be simplified to:

/* store info about data type OIDs */
for (i = 0; i < numattrs; i++)
mcvlist->types[i] = stats[i]->attrtypid;

24). Later in that function, the following comment doesn't appear to
make sense. Is this possibly from an earlier version of the code?

/* copy values from the _previous_ group (last item of) */

25). As for (23), in build_mss(), the loop over the Bitmapset of
attributes never actually uses the attribute numbers (j), so that
could just be a loop from i=0 to numattrs-1, and then that function
doesn't need to be passed the Bitmapset at all -- it could just be
passed the integer numattrs.

26). build_distinct_groups() looks like it makes an implicit
assumption that the counts of the items passed in are all zero. That
is indeed the case, if they've come from build_sorted_items(), because
that does a palloc0(), but that feels a little fragile. I think it
would be better if build_distinct_groups() explicitly set the count
each time it detects a new group.

27). In statext_mcv_serialize(), the TODO comment says

* TODO: Consider packing boolean flags (NULL) for each item into a single char
* (or a longer type) instead of using an array of bool items.

A more efficient way to save space might be to do away with the
boolean null flags entirely, and just use a special index value like
0xffff to signify a NULL value.

28). I just spotted the 1MB limit on the serialised MCV list size. I
think this is going to be too limiting. For example, if the stats
target is at its maximum of 10000, that only leaves around 100 bytes
for each item's values, which is easily exceeded. In fact, I think
this approach for limiting the MCV list size isn't a good one --
consider what would happen if there were lots of very large values.
Would it run out of memory before getting to that test? Even if not,
it would likely take an excessive amount of time.

I think this part of the patch needs a bit of a rethink. My first
thought is to do something similar to what happens for per-column
MCVs, and set an upper limit on the size of each value that is ever
considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
toowide_cnt in analyse.c). Over-wide values should be excluded early
on, and it will need to track whether or not any such values were
excluded, because then it wouldn't be appropriate to treat the stats
as complete and keep the entire list, without calling
get_mincount_for_mcv_list().

That's it for now.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shawn Debnath 2019-03-16 22:27:17 Re: Introduce timeout capability for ConditionVariableSleep
Previous Message Tom Lane 2019-03-16 21:21:12 Re: Fix XML handling with DOCTYPE