Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-16 23:44:23
Message-ID: 37ce47c5-e0cf-c6d3-6923-6b8cfd1cc6b1@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/16/19 10:26 PM, Dean Rasheed wrote:
> On Fri, 15 Mar 2019 at 00:06, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> ... attached patch ...
>
> Some more review comments, carrying on from where I left off:
>
> 16). This regression test fails for me:
>
> @@ -654,11 +654,11 @@
> -- check change of unrelated column type does not reset the MCV statistics
> ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
> SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a =
> 1 AND b = ''1''');
> estimated | actual
> -----------+--------
> - 50 | 50
> + 11 | 50
> (1 row)
>
> Maybe that's platform-dependent, given what you said about
> reltuples/relpages being reset. An easy workaround for this would be
> to modify this test (and perhaps the one that follows) to just query
> pg_statistic_ext to see if the MCV statistics have been reset.
>

Ah, sorry for not explaining this bit - the failure is expected, due to
the reset of relpages/reltuples I mentioned. We do keep the extended
stats, but the relsize estimate changes a bit. It surprised me a bit,
and this test made the behavior apparent. The last patchset included a
piece that changes that - if we decide not to change this, I think we
can simply accept the actual output.

> 17). I'm definitely preferring the new style of tests because they're
> much neater and easier to read, and to directly see the effect of the
> extended statistics. One thing I'd consider adding is a query of
> pg_statistic_ext using pg_mcv_list_items() after creating the MCV
> stats, both to test that function, and to show that the MCV lists have
> the expected contents (provided that output isn't too large).
>

OK, will do.

> 18). Spurious whitespace added to src/backend/statistics/mvdistinct.c.
>

fixed

> 19). In the function comment for statext_mcv_clauselist_selectivity(),
> the name at the top doesn't match the new function name. Also, I think
> it should mention MCV in the initial description. I.e., instead of
>
> +/*
> + * mcv_clauselist_selectivity
> + * Estimate clauses using the best multi-column statistics.
>
> it should say:
>
> +/*
> + * statext_mcv_clauselist_selectivity
> + * Estimate clauses using the best multi-column MCV statistics.
>

fixed

> 20). Later in the same comment, this part should now be deleted:
>
> + *
> + * So (simple_selectivity - base_selectivity) may be seen as a correction for
> + * the part not covered by the MCV list.
>

fixed

> 21). For consistency with other bms_ functions, I think the name of
> the Bitmapset argument for bms_member_index() should just be called
> "a". Nitpicking, I'd also put bms_member_index() immediately after
> bms_is_member() in the source, to match the header.
>

I think I've already done the renames in the last patch I submitted (are
you looking at an older version of the code, perhaps?). I've moved it
right after bms_is_member - good idea.

> 22). mcv_get_match_bitmap() should really use an array of bool rather
> than an array of char. Note that a bool is guaranteed to be of size 1,
> so it won't make things any less efficient, but it will allow some
> code to be made neater. E.g., all clauses like "matches[i] == false"
> and "matches[i] != false" can just be made "!matches[i]" or
> "matches[i]". Also the Min/Max expressions on those match flags can be
> replaced with the logical operators && and ||.
>

fixed

> 23). Looking at this code in statext_mcv_build():
>
> /* store info about data type OIDs */
> i = 0;
> j = -1;
> while ((j = bms_next_member(attrs, j)) >= 0)
> {
> VacAttrStats *colstat = stats[i];
>
> mcvlist->types[i] = colstat->attrtypid;
> i++;
> }
>
> it isn't actually making use of the attribute numbers (j) from attrs,
> so this could be simplified to:
>
> /* store info about data type OIDs */
> for (i = 0; i < numattrs; i++)
> mcvlist->types[i] = stats[i]->attrtypid;
>

yep, fixed

> 24). Later in that function, the following comment doesn't appear to
> make sense. Is this possibly from an earlier version of the code?
>
> /* copy values from the _previous_ group (last item of) */
>

yep, seems like a residue from an older version, fixed

> 25). As for (23), in build_mss(), the loop over the Bitmapset of
> attributes never actually uses the attribute numbers (j), so that
> could just be a loop from i=0 to numattrs-1, and then that function
> doesn't need to be passed the Bitmapset at all -- it could just be
> passed the integer numattrs.
>

fixed

> 26). build_distinct_groups() looks like it makes an implicit
> assumption that the counts of the items passed in are all zero. That
> is indeed the case, if they've come from build_sorted_items(), because
> that does a palloc0(), but that feels a little fragile. I think it
> would be better if build_distinct_groups() explicitly set the count
> each time it detects a new group.
>

good idea, fixed

> 27). In statext_mcv_serialize(), the TODO comment says
>
> * TODO: Consider packing boolean flags (NULL) for each item into a single char
> * (or a longer type) instead of using an array of bool items.
>
> A more efficient way to save space might be to do away with the
> boolean null flags entirely, and just use a special index value like
> 0xffff to signify a NULL value.
>

Hmmm, maybe. I think there's a room for improvement.

> 28). I just spotted the 1MB limit on the serialised MCV list size. I
> think this is going to be too limiting. For example, if the stats
> target is at its maximum of 10000, that only leaves around 100 bytes
> for each item's values, which is easily exceeded. In fact, I think
> this approach for limiting the MCV list size isn't a good one --
> consider what would happen if there were lots of very large values.
> Would it run out of memory before getting to that test? Even if not,
> it would likely take an excessive amount of time.
>

True. I don't have a very good argument for a specific value, or even
having an explicit limit at all. I've initially added it mostly as a
safety for development purposes, but I think you're right we can just
get rid of it. I don't think it'd run out of memory before hitting the
limit, but I haven't tried very hard (but I recall running into the 1MB
limit in the past).

> I think this part of the patch needs a bit of a rethink. My first
> thought is to do something similar to what happens for per-column
> MCVs, and set an upper limit on the size of each value that is ever
> considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
> toowide_cnt in analyse.c). Over-wide values should be excluded early
> on, and it will need to track whether or not any such values were
> excluded, because then it wouldn't be appropriate to treat the stats
> as complete and keep the entire list, without calling
> get_mincount_for_mcv_list().
>
Which part? Serialization / deserialization? Or how we handle long
values when building the MCV list?

cheers

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-retain-reltuples-relpages-on-ALTER-TABLE-20190317.patch.gz application/gzip 1.7 KB
0002-multivariate-MCV-lists-20190317.patch.gz application/gzip 39.4 KB
0003-multivariate-histograms-20190317.patch.gz application/gzip 48.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ramanarayana 2019-03-17 02:58:17 Re: Unaccent extension python script Issue in Windows
Previous Message Chris Cleveland 2019-03-16 23:24:11 Possible to modify query language in an extension?