Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-03-15 00:31:19
Message-ID: 850c3b6b-2641-496c-4efc-500c720f7260@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 03/10/2018 06:19 PM, Alvaro Herrera wrote:
> On 0002:
>
> In terms of docs, I think it's better not to have anything
> user-facing in the README. Consider that users are going to be
> reading the HTML docs only, and many of them may not have the README
> available at all. So anything that could be useful to users must be
> in the XML docs only; keep in the README only stuff that would be
> useful to a developer (a section such as "not yet implemented" would
> belong there, for example). Stuff that's in the XML should not appear
> in the README (because DRY). For the same reason, having the XML docs
> end with "see the README" seems a bad idea to me.
>

I do agree with this in general, but I'm not sure which "user-facing"
bits in the READMEs you mean. I'll go through the docs, but it would be
easier to start with some hints.

> UPDATE_RESULT() is a bit weird to me. I think after staring at it for
> a while it looks okay, but why was it such a shock? In 0002 it's
> only used in one place so I would suggest to have it expanded, but I
> see you use it in 0003 also, three times I think. IMO for clarity it
> seems better to just have the expanded code rather than the macro.
>

I don't quite see why expanding the macro would make the code clearer,
to be honest. I mean, expanding all

UPDATE_RESULT(matches[i], tmp[i], is_or)

calls to

matches[i]
= is_or ? Max(matches[i],tmp[i]) : Min(matches[i], tmp[i]);

does not convey the intent of the code very well, I think. But I'm not
going to fight for it very hard.

That being said, perhaps the name of the macro is not very clear, and
something like MERGE_MATCH would be a better fit.

> find_ext_attnums (and perhaps other places) have references to
> renamed columns, "starelid" and others.
Will fix.

> Also there is this comment:
> /* Prepare to scan pg_statistic_ext for entries having indrelid = this rel. */
> which is outdated since it uses syscache, not a scan. Just remove the
> comment ...
>

Will fix.

> Please add a comment on what does build_attnums() do.
>
> pg_stats_ext_mcvlist_items is odd. I suppose you made it take oid to
> avoid having to deal with a malicious bytea?

That is one reason, yes. The other reason is that we also need to do

getTypeOutputInfo(get_atttype(relid, stakeys->values[i]),
&outfuncs[i], &isvarlena);

so that we can format the MCV items as text. Which means we need
additional information about the extended statistic, so that we can
determine data types. Maybe we could simply store OIDs into the
statistic, similarly to arrays.

That wouldn't solve the issue of malicious values, but maybe we could
make it accept just pg_mcv_list - that should be safe, as casts from
bytea are not supported.

> The query in docs is pretty odd-looking,
>
> SELECT * FROM pg_mcv_list_items((SELECT oid FROM pg_statistic_ext WHERE stxname = 'stts2'));
> If we keep the function as is, I would suggest to use LATERAL instead,
> SELECT m.* FROM pg_statistic_ext, pg_mcv_list_items(oid) m WHERE stxname = 'stts2';
> but seems like it should be more like this instead:
> SELECT m.* FROM pg_statistic_ext, pg_mcv_list_items(stxmcv) m WHERE stxname = 'stts2';
> and not have the output formatting function load the data again from the
> table. It'd be a bit like a type-specific UNNEST.
>

OK, I'll look into that while reviewing the docs.

> There are a few elog(ERROR) messages. The vast majority seem to be just
> internal messages so they're okay, but there is one that should be
> ereport:
>
> + if (total_length > (1024 * 1024))
> + elog(ERROR, "serialized MCV list exceeds 1MB (%ld)", total_length);
> I think we have some precedent for better wording, such as
> errmsg("index row size %zu exceeds maximum %zu for index \"%s\""
> so I would say
> errmsg("serialized MCV list size %zu exceedes maximum %zu" )
> though I wonder when is this error thrown -- if this is detected during
> analyze for example, what happens?
>

Actually, do we need/want to enforce such limit? It seemed like a good
idea back then, but perhaps having a limit with a mostly arbitrary value
is not such a great idea after all.

> There is this FIXME:
> + * FIXME Should skip already estimated clauses (using the estimatedclauses
> + * bitmap).
> Are you planning on implementing this before commit?
>

Actually, in the MCV patch this is not really needed, because it gets
applied before functional dependencies (and those do skip already
estimated clauses).

Moreover the 0003 patch (histograms) reworks this part of the code a bit
(because MCV and histograms are somewhat complementary). So I think this
shouldn't really be a FIXME, but more a comment "We're not handling
this, because it's not needed."

But let me look at this a bit - it might make sense to move some of the
code from 0003 to 0002, which would fix this limitation, of course.

> There are other FIXMEs also. This in particular caught my attention:
>
> + /* merge the bitmap into the existing one */
> + for (i = 0; i < mcvlist->nitems; i++)
> + {
> + /*
> + * Merge the result into the bitmap (Min for AND, Max for OR).
> + *
> + * FIXME this does not decrease the number of matches
> + */
> + UPDATE_RESULT(matches[i], or_matches[i], is_or);
> + }
>
> We come back to UPDATE_RESULT again ... and note how the comment makes
> no sense unless you know what UPDATE_RESULT does internally. This is
> one more indication that the macro is not a great thing to have. Let's
> lose it. But while at it, what to do about the FIXME?
>

Hmmm, not sure that's really a fault of the UPDATE_RESULT macro.

Sorting out the FIXME should not be difficult, I think - just remember
the original values of matches[i], and update the number of matches if
it gets flipped from true to false.

> You also have this
> + /* XXX syscache contains OIDs of deleted stats (not invalidated) */
> + if (!HeapTupleIsValid(htup))
> + return NULL;
> but what does it mean? Is it here to cover for some unknown bug?
> Should we maybe not have this at all?
>

Yeah, that's a bogus/obsolete FIXME, from before we had proper cache
invalidations in RemoveStatistics I think.

> Another XXX comment says
> + * XXX All the memory is allocated in a single chunk, so that the caller
> + * can simply pfree the return value to release all of it.
>
> but I would say just remove the XXX and leave the rest of the comment.
>

OK, makes sense.

> There is another XXX comment that says "this is useless", and I agree.
> Just take it all out ...
>

You mean the check with UINT16_MAX assert? Yeah, I'll get rid of that.

Thanks for the review!

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-03-15 00:41:03 Re: User defined data types in Logical Replication
Previous Message Andres Freund 2018-03-15 00:20:17 Re: JIT compiling with LLVM v11