Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-17 21:03:47
Message-ID: 3edd6dcb-96c2-e45d-f853-8313c7e5b5ac@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

thanks for the review. The attached patches address most of the issues
mentioned in the past several messages, both in the MCV and histogram parts.

A couple of items remains:

> 15. I see you broke out the remainder of the code from
> clauselist_selectivity() into clauselist_selectivity_simple(). The
> comment looks like just a copy and paste from the original. That
> seems like quite a bit of duplication. Is it better to maybe trim down
> the original one?

I don't follow - where do you see the code duplication? Essentially, we
have clauselist_selectivity and clauselist_selectivity_simple, but the
first one calls the second one. The "simple" version is needed because
in some cases we need to perform estimation without multivariate stats
(e.g. to prevent infinite loop due to recursion).

> 18. In dependencies_clauselist_selectivity() there seem to be a new
> bug introduced. We do:
>
> /* mark this one as done, so we don't touch it again. */
> *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>
> but the bms_is_member() check that skipped these has been removed.
>
> It might be easier to document if we just always do:
>
> if (bms_is_member(listidx, *estimatedclauses))
> continue;
>
> at the start of both loops. list_attnums can just be left unset for
> the originally already estimatedclauses.

This was already discussed - I don't think there's any bug, but I'll
look into refactoring the code somehow to make it clear.

> 21. Comment does not really explain what the function does or what the
> arguments mean:
>
> /*
> * statext_is_compatible_clause_internal
> * Does the heavy lifting of actually inspecting the clauses for
> * statext_is_compatible_clause.
> */

Isn't it explained in the statext_is_compatible_clause comment?

> 25. Does statext_is_compatible_clause_internal)_ need to skip over
> RelabelTypes?

I don't think it should, because while RelabelType nodes represent casts
to binary-compatible types, there's no guarantee the semantics actually
is compatible. So for example if you do this:

create table t (a int, b int);
insert into t select mod(i,100), mod(i,100)
from generate_series(1,1000000) s(i);
create statistics s (mcv) on a, b from t;
analyze t;
explain analyze select * from t where a = 1::oid and b = 1::oid;

then there will be a RelabelType nodes casting each column from int4 to
oid. So the estimation will be made following oid semantics. But the MCV
list contains int4 values, and is built using int4-specific operators.

I admit this int4/oid example is fairly trivial, but it's not clear to
me we can assume all RelabelType will behave like that. The types may be
binary-coerible, but may use vastly different operators - think about
citext vs. text, for example.

> 35. The evaluation order of this macro is wrong.
>
> #define ITEM_SIZE(ndims) \
> (ndims * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
>

Nope, as mentioned by Dean, it's actually correct.

> 36. Could do with some comments in get_mincount_for_mcv_list(). What's
> magic about 0.04?

That was copied from another patch, but I've removed the comment
explaining the details - I've now added it back, which I think should be
more than enough.

> 40. The comment in the above item seems to indicate the condition for
> when all items can fit in the number of groups, but the if condition
> does not seem to allow for an exact match?
>
> if (ngroups > nitems)
>
> if you want to check if the number of items can fit in the number of
> groups should it be: if (ngroups >= nitems) or if (nitems <= ngroups)
> ? Perhaps I've misunderstood. The comment is a little confusing as I'm
> not sure where the "Otherwise" code is located.

No, the whole point of that block is to decide how many groups to keep
if there are more groups than we have space for (based on stats target).
So if (ngroups == nitems) or (ngrouos < nitems) then we can keep all of
them.

> 41. I don't think palloc0() is required here. palloc() should be fine
> since you're initialising each element in the loop.
>
> ...
>
> I think I agree with the comment above that chunk about reducing the
> number of pallocs, even if it's just allocating the initial array as
> MCVItems instead of pointers to MCVItems

I've left this as it is for now. The number of extra pallocs() is fairly
low anyway, so I don't think it's worth the extra complexity.

> 47. Wondering about the logic behind the variation between elog() and
> ereport() in statext_mcv_deserialize(). They all looks like "can't
> happen" type errors.

That's mostly random, I'll review and fix that. All "can't happen" cases
should use elog().

> 3. I've not really got into understanding how the new statistics types
> are applied yet, but I found this:
>
> * If asked to build both MCV and histogram, first build the MCV part
> * and then histogram on the remaining rows.
>
> I guess that means we'll get different estimates with:
>
> create statistic a_stats (mcv,histogram) on a,b from t;
>
> vs
>
> create statistic a_stats1 (mcv) on a,b from t;
> create statistic a_stats2 (histogram) on a,b from t;
>
> Is that going to be surprising to people?

Well, I don't have a good answer to this, except for mentioning this in
the SGML docs.

> 5. serialize_histogram() and statext_histogram_deserialize(), should
> these follow the same function naming format?

Perhaps, although serialize_histogram() is static and so it's kinda
internal API.

> 8. Looking at statext_clauselist_selectivity() I see it calls
> choose_best_statistics() passing requiredkinds as STATS_EXT_INFO_MCV |
> STATS_EXT_INFO_HISTOGRAM, do you think the function now needs to
> attempt to find the best match plus the one with the most statistics
> kinds?
>
> It might only matter if someone had:
>
> create statistic a_stats1 (mcv) on a,b from t;
> create statistic a_stats2 (histogram) on a,b from t;
> create statistic a_stats3 (mcv,histogram) on a,b from t;
>
> Is it fine to just return a_stats1 and ignore the fact that a_stats3
> is probably better? Or too corner case to care?

I don't know. My assumption is people will not create such overlapping
statics.

> 9. examine_equality_clause() assumes it'll get a Var. I see we should
> only allow clauses that pass statext_is_compatible_clause_internal(),
> so maybe it's worth an Assert(IsA(var, Var)) along with a comment to
> mention anything else could not have been allowed.

Maybe.

> 10. Does examine_equality_clause need 'root' as an argument?

Probably not. I guess it's a residue some older version.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-multivariate-MCV-lists-20190117.patch.gz application/gzip 42.0 KB
0002-multivariate-histograms-20190117.patch.gz application/gzip 47.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mikael Kjellström 2019-01-17 21:12:57 Re: PSA: we lack TAP test coverage on NetBSD and OpenBSD
Previous Message Tom Lane 2019-01-17 20:42:00 Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)