Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-17 01:19:16
Message-ID: 72a97865-9864-9cd7-eedb-e96dae4bde65@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/16/19 7:56 AM, David Rowley wrote:> On Tue, 15 Jan 2019 at 08:21,
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Turns out you were right - the attribute_referenced piece was quite
>> unnecessary. So I've removed it. I've also extended the regression tests
>> to verify changing type of another column does not reset the stats.
>
> (Trying to find my feet over here)
>
> I've read over the entire thread, and apart from missing the last two
> emails and therefore the latest patch, I managed to read over most of
> the MCV patch. I didn't quite get to reading mcv.c and don't quite
> have the energy to take that on now.
>
Thanks for looking!

> At this stage I'm trying to get to know the patch. I read a lot of
> discussing between you and Dean ironing out how the stats should be
> used to form selectivities. At the time I'd not read the patch yet,
> so most of it went over my head.
>
> I did note down a few things on my read. I've included them below.
> Hopefully, they're useful.
>
> MCV list review
>
> 1. In mvc.c there's Assert(ndistinct <= UINT16_MAX); This should be
> PG_UINT16_MAX
>
Yep. Will fix.

> 2. math.h should be included just after postgres.h
>
Yep. Will fix.

> 3. Copyright is still -2017 in mcv.c. Hopefully, if you change it to
> 2019, you'll never have to bump it ever again! :-)
>
Optimist ;-)

> 4. Looking at pg_stats_ext_mcvlist_items() I see you've coded the
> string building manually. The way it's coded I'm finding a little
> strange. It means the copying becomes quadratic due to
>
> snprintf(buff, 1024, format, values[1], DatumGetPointer(valout));
> strncpy(values[1], buff, 1023);
>
> So basically, generally, here you're building a new string with
> values[1] followed by a comma, then followed by valout. One the next
> line you then copy that new buffer back into values[1]. I understand
> this part is likely not performance critical, but I see no reason to
> write the code this way.
>
> Are you limiting the strings to 1024 bytes on purpose? I don't see
> any comment mentioning you want to truncate strings.
>
> Would it not be better to do this part using a
> AppendStringInfoString()? and just manually add a '{', ',' or '}' as
> and when required?
>> DatumGetPointer(valout) should really be using DatumGetCString(valout).
>
> Likely you can also use heap_form_tuple. This will save you having to
> convert ints into strings then only to have BuildTupleFromCStrings()
> do the reverse.
>
I agree. I admit all of this is a residue of an initial hackish version
of the function, and should be changed to StringInfo. Will fix.

> 5. individiaul -> individual
> lists. This allows very accurate estimates for individiaul
columns, but
>
> litst -> lists
>
> litst on combinations of columns. Similarly to functional
dependencies
>
Will fix.

> 6. Worth mentioning planning cycles too?
>
> "It's advisable to create <literal>MCV</literal> statistics
objects only
> on combinations of columns that are actually used in conditions
together,
> and for which misestimation of the number of groups is
resulting in bad
> plans. Otherwise, the <command>ANALYZE</command> cycles are
just wasted."
>
Makes sense. Although that's what we say about the existing stats, so
perhaps we should tweak that too.

> 7. straight-forward -> straightforward
>
> (most-common values) lists, a straight-forward extension of the
per-column
> > 8. adresses -> addresses
>
> statistics adresses the limitation by storing individual values, but it
>
Will fix. Thanks for proof-reading.

> 9. Worth mentioning ANALYZE time?
>
> This section introduces multivariate variant of
<acronym>MCV</acronym>
> (most-common values) lists, a straight-forward extension of the
per-column
> statistics described in <xref
linkend="row-estimation-examples"/>. This
> statistics adresses the limitation by storing individual values,
but it
> is naturally more expensive, both in terms of storage and
planning time.
>
Yeah.

> 10. low -> a low
>
> with low number of distinct values. Before looking at the second
query,
>
> 11. them -> then
>
> on items in the <acronym>MCV</acronym> list, and them sums the
frequencies
>
Will fix.

> 12. Should we be referencing the source from the docs?
>
> See <function>mcv_clauselist_selectivity</function>
> in <filename>src/backend/statistics/mcv.c</filename> for details.
>
> hmm. I see it's not the first going by: git grep -E "\w+\.c\<"
> gt
Hmm, that does not return anything to me - do you actually see any
references to .c files in the sgml docs? I agree that probably is not a
good idea, so I'll remove that.

> 13. Pretty minor, but the following loop in
> UpdateStatisticsForTypeChange() could use a break;
>
> attribute_referenced = false;
> for (i = 0; i < staForm->stxkeys.dim1; i++)
> if (attnum == staForm->stxkeys.values[i])
> attribute_referenced = true;
>
> UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
> removed this already
>
;-)

> 14. Again in UpdateStatisticsForTypeChange(), would it not be better
> to do the statext_is_kind_built(oldtup, STATS_EXT_MCV) check before
> checking if the stats contain this column? This gets rid of your
> reset_stats variable.
>
> I also don't quite understand why there's an additional check for
> statext_is_kind_built(oldtup, STATS_EXT_MCV), which if that's false
> then why do we do the dummy update on the tuple?
>
> Have you just coded this so that you can support other stats types
> later without too much modification? If so, I'm finding it a bit
> confusing to read, so maybe it's worth only coding it that way if
> there's more than one stats type to reset for.
>
> UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
> removed this already
;-)

>
> 15. I see you broke out the remainder of the code from
> clauselist_selectivity() into clauselist_selectivity_simple(). The
> comment looks like just a copy and paste from the original. That
> seems like quite a bit of duplication. Is it better to maybe trim down
> the original one?
>
I'll see what I can do.

> 16. I initially didn't see how this code transformed the bms into an
array:
>
> /*
> * Transform the bms into an array, to make accessing i-th member easier,
> * and then construct a filtered version with only attnums referenced
> * by the dependency we validate.
> */
> attnums = build_attnums(attrs);
>
> attnums_dep = (int *)palloc(k * sizeof(int));
> for (i = 0; i < k; i++)
> attnums_dep[i] = attnums[dependency[i]];
>
> Would it be better to name build_attnums() build_attnums_array() ?
>
> I think it would also be better to, instead of saying "the bms", just
> say "attrs".
>
Hmmm, maybe.

> 17. dependencies_clauselist_selectivity(), in:
>
> if ((dependency_is_compatible_clause(clause, rel->relid, &attnum)) &&
> (!bms_is_member(listidx, *estimatedclauses)))
>
> would it be better to have the bms_is_member() first?
>
Yes, that might be a tad faster.

> 18. In dependencies_clauselist_selectivity() there seem to be a new
> bug introduced. We do:
>
> /* mark this one as done, so we don't touch it again. */
> *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>
> but the bms_is_member() check that skipped these has been removed.
>
> It might be easier to document if we just always do:
>
> if (bms_is_member(listidx, *estimatedclauses))
> continue;
>
> at the start of both loops. list_attnums can just be left unset for
> the originally already estimatedclauses.
>
It's probably not as clear as it should be, but if the clause is already
estimated (or incompatible), then the list_attnums[] entry will be
InvalidAttrNumber. Which is what we check in the second loop.

> 19. in extended_stats.c, should build_attnums() be documented that the
> Bitmapset members are not offset by
> FirstLowInvalidHeapAttributeNumber. I think mostly Bitmapsets of
> Attnums are offset by this, so might be worth a mention.
>
Good point.

> 20. I think bms_member_index() needs documentation. I imagine you'll
> want to mention that the bitmapset must contain the given varattno,
> else surely it'll do the wrong thing if it's not. Perhaps an
> Assert(bms_is_member(keys, varattno)); should be added to it.
>
Agreed. Or maybe make it return -1 in that case? It might even have
missing_ok flag or something like that.

> 21. Comment does not really explain what the function does or what the
> arguments mean:
>
> /*
> * statext_is_compatible_clause_internal
> * Does the heavy lifting of actually inspecting the clauses for
> * statext_is_compatible_clause.
> */
>
Will improve.

> 22. In statext_is_compatible_clause_internal():
>
> /* Var = Const */
>
> The above comment seems a bit misplaced. It looks like the code below
> it is looking for an OpExpr in the form of "Var <op> Const", or "Const
> <op> Var".
>
Yes, I agree.

> 23. statext_is_compatible_clause_internal() you have:
>
> if ((get_oprrest(expr->opno) != F_EQSEL) &&
> (get_oprrest(expr->opno) != F_NEQSEL) &&
> (get_oprrest(expr->opno) != F_SCALARLTSEL) &&
> (get_oprrest(expr->opno) != F_SCALARLESEL) &&
> (get_oprrest(expr->opno) != F_SCALARGTSEL) &&
> (get_oprrest(expr->opno) != F_SCALARGESEL))
> return false;
>
> 6 calls to get_oprrest(). 1 is enough.
>
> How does the existing MCV and histogram stats handle these operators?
> Does it insist on a btree opfamily, or is it as crude as this too?
>
It's this crude too, AFAICS.

> 24. In statext_is_compatible_clause_internal, you have:
>
> /* NOT/AND/OR clause */
> if (or_clause(clause) ||
> and_clause(clause) ||
> not_clause(clause))
> {
> /*
> * AND/OR/NOT-clauses are supported if all sub-clauses are supported
>
> Looks like you were not sure which order to have these, so you just
> tried a few variations :-D Maybe just make them all the same?
>
If you insist ;-)

> 25. Does statext_is_compatible_clause_internal)_ need to skip over
RelabelTypes?
>
I believe it does, based on what I've observed during development. Why
do you think it's not necessary?

> 26. In statext_is_compatible_clause_internal() you mention: /* We only
> support plain Vars for now */, but I see nothing that ensures that
> only Vars are allowed in the is_opclause() condition.
>
> /* see if it actually has the right */
> ok = (NumRelids((Node *) expr) == 1) &&
> (is_pseudo_constant_clause(lsecond(expr->args)) ||
> (varonleft = false,
> is_pseudo_constant_clause(linitial(expr->args))));
>
> the above would allow var+var == const through.
>
But then we call statext_is_compatible_clause_internal on it again, and
that only allows Vars and "Var op Const" expressions. Maybe there's a
way around that?

> The NumRelids seems like it would never have anything > 1 as you have
> a BMS_SINGLETON test on the RestrictInfo where you're calling this
> function from. I think you likely want just a IsA(... , Var) checks
> here, after skipping over RelabelTypes.
> > Not sure what "/* see if it actually has the right */" means.
>
That should have been "right structure" I believe.

> 27. Should the function be named something more related to MCV? The
> name makes it appear fairly generic to extended stats.
>
> * statext_is_compatible_clause
> * Determines if the clause is compatible with MCV lists.
>
No, because it's supposed to also handle histograms (and perhaps other
stats types) in the future.

> 28. This comment seems wrong:
>
> * Currently we only support Var = Const, or Const = Var. It may be
possible
> * to expand on this later.
>
> I see you're allowing IS NULL and IS NOT NULL too. = does not seem to
> be required either.
>
OK, will fix.

> 29. The following fragment makes me think we're only processing
> clauses to use them with MCV lists, but the comment claims "dependency
> selectivity estimations"
>
> /* we're interested in MCV lists */
> int types = STATS_EXT_MCV;
>
> /* check if there's any stats that might be useful for us. */
> if (!has_stats_of_kind(rel->statlist, types))
> return (Selectivity) 1.0;
>
> list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
> list_length(clauses));
>
> /*
> * Pre-process the clauses list to extract the attnums seen in each item.
> * We need to determine if there's any clauses which will be useful for
> * dependency selectivity estimations. Along the way we'll record all of
>
Yeah, that's copy-pasto.

> 30. Is it better to do the bms_is_member() first here?
>
> if ((statext_is_compatible_clause(clause, rel->relid, &attnums)) &&
> (!bms_is_member(listidx, *estimatedclauses)))
>
> Likely it'll be cheaper.
>
Yeah, same as before.

> 31. I think this comment should be /* Ensure choose_best_statistics()
> didn't mess up */
>
> /* We only understand MCV lists for now. */
> Assert(stat->kind == STATS_EXT_MCV);
>
I'll expand the comment a bit.

> 32. What're lags?
>
> bool *isnull; /* lags of NULL values (up to 32 columns) */
>
Should be "flags" I think.

> 33. "ndimentions"? There's no field in the struct by that name. I'd
> assume it's the same size as the isnull array above it?
>
> Datum *values; /* variable-length (ndimensions) */
>
Yes, that's the case.

> 34. README.mcv
>
> * large -> a large
>
> For columns with large number of distinct values (e.g. those with
continuous
>
> * Is the following up-to-date? I thought I saw code for NOT too?
>
> (a) equality clauses WHERE (a = 1) AND (b = 2)
> (b) inequality clauses WHERE (a < 1) AND (b >= 2)
> (c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL)
> (d) OR clauses WHERE (a < 1) OR (b >= 2)
>
> * multi-variate -> multivariate
>
> are large the list may be quite large. This is especially true for
multi-variate
>
> * a -> an
>
> TODO Currently there's no logic to consider building only a MCV list
(and not
>
> * I'd have said "an SRF", but git grep "a SRF" disagrees with me. I
> guess those people must be pronouncing it, somehow!? surf... serf... ?
>
> easier, there's a SRF returning detailed information about the MCV lists.
>
> * Is it better to put a working SQL in here?
>
> SELECT * FROM pg_mcv_list_items(stxmcv);
>
> maybe like:
>
> SELECT s.* FROM pg_statistic_ext, LATERAL pg_mcv_list_items(stxmcv) s;
>
> Maybe with a WHERE clause?
>
> * This list seems outdated.
>
> - item index (0, ..., (nitems-1))
> - values (string array)
> - nulls only (boolean array)
> - frequency (double precision)
>
> base_frequency seems to exist now too.
>
Yeah, those are mostly typos. Will fix.

thanks

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-01-17 01:45:20 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Michael Paquier 2019-01-17 01:16:38 Re: [HACKERS] REINDEX CONCURRENTLY 2.0