Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-16 12:56:47
Message-ID: CAKJS1f9Jh8m5M1hSuga_q2YThX862XFm7ytmf7z-r+JfzossEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 15 Jan 2019 at 08:21, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Turns out you were right - the attribute_referenced piece was quite
> unnecessary. So I've removed it. I've also extended the regression tests
> to verify changing type of another column does not reset the stats.

(Trying to find my feet over here)

I've read over the entire thread, and apart from missing the last two
emails and therefore the latest patch, I managed to read over most of
the MCV patch. I didn't quite get to reading mcv.c and don't quite
have the energy to take that on now.

At this stage I'm trying to get to know the patch. I read a lot of
discussing between you and Dean ironing out how the stats should be
used to form selectivities. At the time I'd not read the patch yet,
so most of it went over my head.

I did note down a few things on my read. I've included them below.
Hopefully, they're useful.

MCV list review

1. In mvc.c there's Assert(ndistinct <= UINT16_MAX); This should be
PG_UINT16_MAX

2. math.h should be included just after postgres.h

3. Copyright is still -2017 in mcv.c. Hopefully, if you change it to
2019, you'll never have to bump it ever again! :-)

4. Looking at pg_stats_ext_mcvlist_items() I see you've coded the
string building manually. The way it's coded I'm finding a little
strange. It means the copying becomes quadratic due to

snprintf(buff, 1024, format, values[1], DatumGetPointer(valout));
strncpy(values[1], buff, 1023);

So basically, generally, here you're building a new string with
values[1] followed by a comma, then followed by valout. One the next
line you then copy that new buffer back into values[1]. I understand
this part is likely not performance critical, but I see no reason to
write the code this way.

Are you limiting the strings to 1024 bytes on purpose? I don't see
any comment mentioning you want to truncate strings.

Would it not be better to do this part using a
AppendStringInfoString()? and just manually add a '{', ',' or '}' as
and when required?

DatumGetPointer(valout) should really be using DatumGetCString(valout).

Likely you can also use heap_form_tuple. This will save you having to
convert ints into strings then only to have BuildTupleFromCStrings()
do the reverse.

5. individiaul -> individual
lists. This allows very accurate estimates for individiaul columns, but

litst -> lists

litst on combinations of columns. Similarly to functional dependencies

6. Worth mentioning planning cycles too?

"It's advisable to create <literal>MCV</literal> statistics objects only
on combinations of columns that are actually used in conditions together,
and for which misestimation of the number of groups is resulting in bad
plans. Otherwise, the <command>ANALYZE</command> cycles are just wasted."

7. straight-forward -> straightforward

(most-common values) lists, a straight-forward extension of the per-column

8. adresses -> addresses

statistics adresses the limitation by storing individual values, but it

9. Worth mentioning ANALYZE time?

This section introduces multivariate variant of <acronym>MCV</acronym>
(most-common values) lists, a straight-forward extension of the per-column
statistics described in <xref linkend="row-estimation-examples"/>. This
statistics adresses the limitation by storing individual values, but it
is naturally more expensive, both in terms of storage and planning time.

10. low -> a low

with low number of distinct values. Before looking at the second query,

11. them -> then

on items in the <acronym>MCV</acronym> list, and them sums the frequencies

12. Should we be referencing the source from the docs?

See <function>mcv_clauselist_selectivity</function>
in <filename>src/backend/statistics/mcv.c</filename> for details.

hmm. I see it's not the first going by: git grep -E "\w+\.c\<"

13. Pretty minor, but the following loop in
UpdateStatisticsForTypeChange() could use a break;

attribute_referenced = false;
for (i = 0; i < staForm->stxkeys.dim1; i++)
if (attnum == staForm->stxkeys.values[i])
attribute_referenced = true;

UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
removed this already

14. Again in UpdateStatisticsForTypeChange(), would it not be better
to do the statext_is_kind_built(oldtup, STATS_EXT_MCV) check before
checking if the stats contain this column? This gets rid of your
reset_stats variable.

I also don't quite understand why there's an additional check for
statext_is_kind_built(oldtup, STATS_EXT_MCV), which if that's false
then why do we do the dummy update on the tuple?

Have you just coded this so that you can support other stats types
later without too much modification? If so, I'm finding it a bit
confusing to read, so maybe it's worth only coding it that way if
there's more than one stats type to reset for.

UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
removed this already

15. I see you broke out the remainder of the code from
clauselist_selectivity() into clauselist_selectivity_simple(). The
comment looks like just a copy and paste from the original. That
seems like quite a bit of duplication. Is it better to maybe trim down
the original one?

16. I initially didn't see how this code transformed the bms into an array:

/*
* Transform the bms into an array, to make accessing i-th member easier,
* and then construct a filtered version with only attnums referenced
* by the dependency we validate.
*/
attnums = build_attnums(attrs);

attnums_dep = (int *)palloc(k * sizeof(int));
for (i = 0; i < k; i++)
attnums_dep[i] = attnums[dependency[i]];

Would it be better to name build_attnums() build_attnums_array() ?

I think it would also be better to, instead of saying "the bms", just
say "attrs".

17. dependencies_clauselist_selectivity(), in:

if ((dependency_is_compatible_clause(clause, rel->relid, &attnum)) &&
(!bms_is_member(listidx, *estimatedclauses)))

would it be better to have the bms_is_member() first?

18. In dependencies_clauselist_selectivity() there seem to be a new
bug introduced. We do:

/* mark this one as done, so we don't touch it again. */
*estimatedclauses = bms_add_member(*estimatedclauses, listidx);

but the bms_is_member() check that skipped these has been removed.

It might be easier to document if we just always do:

if (bms_is_member(listidx, *estimatedclauses))
continue;

at the start of both loops. list_attnums can just be left unset for
the originally already estimatedclauses.

19. in extended_stats.c, should build_attnums() be documented that the
Bitmapset members are not offset by
FirstLowInvalidHeapAttributeNumber. I think mostly Bitmapsets of
Attnums are offset by this, so might be worth a mention.

20. I think bms_member_index() needs documentation. I imagine you'll
want to mention that the bitmapset must contain the given varattno,
else surely it'll do the wrong thing if it's not. Perhaps an
Assert(bms_is_member(keys, varattno)); should be added to it.

21. Comment does not really explain what the function does or what the
arguments mean:

/*
* statext_is_compatible_clause_internal
* Does the heavy lifting of actually inspecting the clauses for
* statext_is_compatible_clause.
*/

22. In statext_is_compatible_clause_internal():

/* Var = Const */

The above comment seems a bit misplaced. It looks like the code below
it is looking for an OpExpr in the form of "Var <op> Const", or "Const
<op> Var".

23. statext_is_compatible_clause_internal() you have:

if ((get_oprrest(expr->opno) != F_EQSEL) &&
(get_oprrest(expr->opno) != F_NEQSEL) &&
(get_oprrest(expr->opno) != F_SCALARLTSEL) &&
(get_oprrest(expr->opno) != F_SCALARLESEL) &&
(get_oprrest(expr->opno) != F_SCALARGTSEL) &&
(get_oprrest(expr->opno) != F_SCALARGESEL))
return false;

6 calls to get_oprrest(). 1 is enough.

How does the existing MCV and histogram stats handle these operators?
Does it insist on a btree opfamily, or is it as crude as this too?

24. In statext_is_compatible_clause_internal, you have:

/* NOT/AND/OR clause */
if (or_clause(clause) ||
and_clause(clause) ||
not_clause(clause))
{
/*
* AND/OR/NOT-clauses are supported if all sub-clauses are supported

Looks like you were not sure which order to have these, so you just
tried a few variations :-D Maybe just make them all the same?

25. Does statext_is_compatible_clause_internal)_ need to skip over RelabelTypes?

26. In statext_is_compatible_clause_internal() you mention: /* We only
support plain Vars for now */, but I see nothing that ensures that
only Vars are allowed in the is_opclause() condition.

/* see if it actually has the right */
ok = (NumRelids((Node *) expr) == 1) &&
(is_pseudo_constant_clause(lsecond(expr->args)) ||
(varonleft = false,
is_pseudo_constant_clause(linitial(expr->args))));

the above would allow var+var == const through.

The NumRelids seems like it would never have anything > 1 as you have
a BMS_SINGLETON test on the RestrictInfo where you're calling this
function from. I think you likely want just a IsA(... , Var) checks
here, after skipping over RelabelTypes.

Not sure what "/* see if it actually has the right */" means.

27. Should the function be named something more related to MCV? The
name makes it appear fairly generic to extended stats.

* statext_is_compatible_clause
* Determines if the clause is compatible with MCV lists.

28. This comment seems wrong:

* Currently we only support Var = Const, or Const = Var. It may be possible
* to expand on this later.

I see you're allowing IS NULL and IS NOT NULL too. = does not seem to
be required either.

29. The following fragment makes me think we're only processing
clauses to use them with MCV lists, but the comment claims "dependency
selectivity estimations"

/* we're interested in MCV lists */
int types = STATS_EXT_MCV;

/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, types))
return (Selectivity) 1.0;

list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
list_length(clauses));

/*
* Pre-process the clauses list to extract the attnums seen in each item.
* We need to determine if there's any clauses which will be useful for
* dependency selectivity estimations. Along the way we'll record all of

30. Is it better to do the bms_is_member() first here?

if ((statext_is_compatible_clause(clause, rel->relid, &attnums)) &&
(!bms_is_member(listidx, *estimatedclauses)))

Likely it'll be cheaper.

31. I think this comment should be /* Ensure choose_best_statistics()
didn't mess up */

/* We only understand MCV lists for now. */
Assert(stat->kind == STATS_EXT_MCV);

32. What're lags?

bool *isnull; /* lags of NULL values (up to 32 columns) */

33. "ndimentions"? There's no field in the struct by that name. I'd
assume it's the same size as the isnull array above it?

Datum *values; /* variable-length (ndimensions) */

34. README.mcv

* large -> a large

For columns with large number of distinct values (e.g. those with continuous

* Is the following up-to-date? I thought I saw code for NOT too?

(a) equality clauses WHERE (a = 1) AND (b = 2)
(b) inequality clauses WHERE (a < 1) AND (b >= 2)
(c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL)
(d) OR clauses WHERE (a < 1) OR (b >= 2)

* multi-variate -> multivariate

are large the list may be quite large. This is especially true for multi-variate

* a -> an

TODO Currently there's no logic to consider building only a MCV list (and not

* I'd have said "an SRF", but git grep "a SRF" disagrees with me. I
guess those people must be pronouncing it, somehow!? surf... serf... ?

easier, there's a SRF returning detailed information about the MCV lists.

* Is it better to put a working SQL in here?

SELECT * FROM pg_mcv_list_items(stxmcv);

maybe like:

SELECT s.* FROM pg_statistic_ext, LATERAL pg_mcv_list_items(stxmcv) s;

Maybe with a WHERE clause?

* This list seems outdated.

- item index (0, ..., (nitems-1))
- values (string array)
- nulls only (boolean array)
- frequency (double precision)

base_frequency seems to exist now too.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-01-16 13:14:41 Re: [HACKERS] generated columns
Previous Message Dmitry Dolgov 2019-01-16 12:47:06 jsonb_set for an array with an index outside of boundaries