Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, david(at)pgmasters(dot)net, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, jeff(dot)janes(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-08-10 18:34:58
Message-ID: b59fc9cc-cee4-a7b8-461c-3da4af3e8691@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/10/2016 02:23 PM, Michael Paquier wrote:
> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>> Patch 0001: there have been comments about that before, and you have
>>> put the checks on RestrictInfo in a couple of variables of
>>> pull_varnos_walker, so nothing to say from here.
>>>
>>
>> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
>> a proposed solution?
>
> I think that's reasonable.
>

Well, to me the 0001 feels more like a temporary workaround rather than
a proper solution. I just don't know how to deal with it so I've kept it
for now. Pretty sure there will be complaints that adding RestrictInfo
to the expression walkers is not a nice idea.

>> ...
>>
>> The idea is that the syntax should work even for statistics built on
>> multiple tables, e.g. to provide better statistics for joins. That's why the
>> schema may be specified (as each table might be in different schema), and so
>> on.
>
> So you mean that the same statistics could be shared between tables?
> But as this is visibly not a concept introduced yet in this set of
> patches, why not just cut it off for now to simplify the whole? If
> there is no schema-related field in pg_mv_statistics we could still
> add it later if it proves to be useful.
>

Yes, I think creating statistics on multiple tables is one of the
possible future directions. One of the previous patch versions
introduced ALTER TABLE ... ADD STATISTICS syntax, but that ran into
issues in gram.y, and given the multi-table possibilities the CREATE
STATISTICS seems like a much better idea anyway.

But I guess you're right we may make this a bit more strict now, and
relax it in the future if needed. For example as we only support
single-table statistics at this point, we may remove the schema and
always create the statistics in the schema of the table.

But I don't think we should make the statistics names unique only within
a table (instead of within the schema).

The difference between those two cases is that if we allow multi-table
statistics in the future, we can simply allow specifying the schema and
everything will work just fine. But it'd break the second case, as it
might result in conflicts in existing schemas.

I do realize this might be seen as a case of "future proofing" based on
dubious predictions of how something might work, but OTOH this (schema
inherited from table, unique within a schema) is pretty consistent with
how this work for indexes.

>>> +
>>> /*
>>> Spurious noise in the patch.
>>>
>>> + /* check that at least some statistics were requested */
>>> + if (!build_dependencies)
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_SYNTAX_ERROR),
>>> + errmsg("no statistics type (dependencies) was
>>> requested")));
>>> So, WITH (dependencies) is mandatory in any case. Why not just
>>> dropping it from the first cut then?
>>
>>
>> Because the follow-up patches extend this to require at least one statistics
>> type. So in 0004 it becomes
>>
>> if (!(build_dependencies || build_mcv))
>>
>> and in 0005 it's
>>
>> if (!(build_dependencies || build_mcv || build_histogram))
>>
>> We might drop it from 0002 (and assume build_dependencies=true), and then
>> add the check in 0004. But it seems a bit pointless.
>
> This is a complicated set of patches. I'd think that we should try to
> simplify things as much as possible first, and the WITH clause is not
> mandatory to have as of 0002.
>

OK, I can remove the WITH from the 0002 part. Not a big deal.

>>> Statistics definition reorder the columns by itself depending on their
>>> order. For example:
>>> create table aa (a int, b int);
>>> create statistics aas on aa(b, a) with (dependencies);
>>> \d aa
>>> "public.aas" (dependencies) ON (a, b)
>>> As this defines a correlation between multiple columns, isn't it wrong
>>> to assume that (b, a) and (a, b) are always the same correlation? I
>>> don't recall such properties as being always commutative (old
>>> memories, I suck at stats in general). [...reading README...] So this
>>> is caused by the implementation limitations that only limit the
>>> analysis between interactions of two columns. Still it seems incorrect
>>> to reorder the user-visible portion.
>>
>> I don't follow. If you talk about Pearson's correlation, that clearly does
>> not depend on the order of columns - it's perfectly independent of that. If
>> you talk about about correlation in the wider sense (i.e. arbitrary
>> dependence between columns), that might depend - but I don't remember a
>> single piece of the patch where this might be a problem.
>
> Yes, based on what is done in the patch that may not be a problem, but
> I am wondering if this is not restricting things too much.
>

Let's keep the code as it is. If we run into this issue in the future,
we can easily relax this - there's nothing depending on the ordering of
attnums, IIRC.

>> Also, which README states that we can only analyze interactions between two
>> columns? That's pretty clearly not the case - the patch should handle
>> dependencies between more columns without any problems.
>
> I have noticed that the patch evaluates all the set of permutations
> possible using a column list, it seems to me though that say if we
> have three columns (a,b,c) listed in a statistics, (a,b) => c and
> (b,a) => c are two different things.
>

Yes, those are two different functional dependencies, of course. But the
algorithm (during ANALYZE) should discover all of them, and even the
examples are using three columns, so I'm not sure what you mean by
"analyze interactions between two columns"?

>>> There is a lot of mumbo-jumbo regarding the way dependencies are
>>> stored with mainly serialize_mv_dependencies and
>>> deserialize_mv_dependencies that operates them from bytea/dep trees.
>>> That's not cool and not portable because pg_mv_statistic represents
>>> that as pure bytea. I would suggest creating a generic data type that
>>> does those operations, named like pg_dependency_tree and then use that
>>> in those new catalogs. pg_node_tree is a precedent of such a thing.
>>> New features could as well make use of this new data type of we are
>>> able to design that in a way generic enough, so that would be a base
>>> patch that the current 0002 applies on top of.
>>
>>
>> Interesting idea, haven't thought about that. So are you suggesting to add a
>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>
> Yes that would be something like that, it would be actually perhaps
> better to have one single data type, and be able to switch between
> each model easily instead of putting byteas in the catalog.

Hmmm, not sure about that. For example what about combinations of
statistics - e.g. when we have MCV list on the most common values and a
histogram on the rest? Should we store both as a single value, or would
that be in two separate values, or what?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2016-08-10 18:36:54 Re: Proposal for CSN based snapshots
Previous Message Peter Eisentraut 2016-08-10 18:24:35 Re: pg_ctl promote wait