Quick Links

Re: multivariate statistics (v19)

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, david(at)pgmasters(dot)net, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, jeff(dot)janes(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: multivariate statistics (v19)
Date:	2016-08-10 04:41:30
Message-ID:	CAB7nPqTU40Q5_NSgHVoMJfbyH1HDtqMbFDJ+kwFJSpam35b3Qg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> 1) enriching the query tree with multivariate statistics info
>
> Right now all the stuff related to multivariate statistics estimation
> happens in clausesel.c - matching condition to statistics, selection of
> statistics to use (if there are multiple usable stats), etc. So pretty much
> all this info is internal to clausesel.c and does not get outside.

This does not seem bad to me as first sight but...

> I'm starting to think that some of the steps (matching quals to stats,
> selection of stats) should happen in a "preprocess" step before the actual
> estimation, storing the information (which stats to use, etc.) in a new type
> of node in the query tree - something like RestrictInfo.
>
> I believe this needs to happen sometime after deconstruct_jointree() as that
> builds RestrictInfos nodes, and looking at planmain.c, right after
> extract_restriction_or_clauses seems about right. Haven't tried, though.
>
> This would move all the "statistics selection" logic from clausesel.c,
> separating it from the "actual estimation" and simplifying the code.
>
> But more importantly, I think we'll need to show some of the data in EXPLAIN
> output. With per-column statistics it's fairly straightforward to determine
> which statistics are used and how. But with multivariate stats things are
> often more complicated - there may be multiple candidate statistics (e.g.
> histograms covering different subsets of the conditions), it's possible to
> apply them in different orders, etc.
>
> But EXPLAIN can't show the info if it's ephemeral and available only within
> clausesel.c (and thrown away after the estimation).

This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).
>
> Assume there's a table "t" with 3 columns (a, b, c), and that we're
> estimating query:
>
> SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>
> but that we only have two statistics (a,b) and (b,c). The current patch does
> about this:
>
> P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>
> i.e. it estimates the first two conditions using (a,b), and then estimates
> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
> but it only works as long as the query contains conditions "connecting" the
> two statistics. So if we remove the "b=2" condition from the query, this
> stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.

Patch 0002:
+ <para>
+ <command>CREATE STATISTICS</command> will create a new multivariate
+ statistics on the table. The statistics will be created in the in the
+ current database. The statistics will be owned by the user issuing
+ the command.
+ </para>
s/in the/in the/.

+ <para>
+ Create table <structname>t1</> with two functionally dependent columns, i.e.
+ knowledge of a value in the first column is sufficient for detemining the
+ value in the other column. Then functional dependencies are built on those
+ columns:
s/detemining/determining/

+ <para>
+ If a schema name is given (for example, <literal>CREATE STATISTICS
+ myschema.mystat ...</>) then the statistics is created in the specified
+ schema. Otherwise it is created in the current schema. The name of
+ the table must be distinct from the name of any other statistics in the
+ same schema.
+ </para>
I would just assume that a statistics is located on the schema of the
relation it depends on. So the thing that may be better to do is just:
- Register the OID of the table a statistics depends on but not the schema.
- Give up on those query extensions related to the schema.
- Allow the same statistics name to be used for multiple tables.
- Just fail if a statistics name is being reused on the table again.
It may be better to complain about that even if the column list is
different.
- Register the dependency between the statistics and the table.

+ALTER STATISTICS <replaceable class="parameter">name</replaceable>
OWNER TO { <replaceable class="PARAMETER">new_owner</replaceable> |
CURRENT_USER | SESSION_USER }
On the same line, is OWNER TO really necessary? I could have assumed
that if a user is able to query the set of columns related to a
statistics, he should have access to it.

=# create statistics aa_a_b3 on aam (a, b) with (dependencies);
ERROR: 23505: duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL: Key (staname, stanamespace)=(aa_a_b3, 2200) already exists.
SCHEMA NAME: pg_catalog
TABLE NAME: pg_mv_statistic
CONSTRAINT NAME: pg_mv_statistic_name_index
LOCATION: _bt_check_unique, nbtinsert.c:433
When creating a multivariate function with a name that already exists,
this error message should be more friendly.

=# create table aa (a int, b int);
CREATE TABLE
=# create view aav as select * from aa;
CREATE VIEW
=# create statistics aab_v on aav (a, b) with (dependencies);
CREATE STATISTICS
Why do views and foreign tables support this command? This code also
mentions that this case is not actually supported:
+ /* multivariate stats are supported on tables and matviews */
+ if (rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW)
+ tupdesc = RelationGetDescr(rel);

};

+
/*
Spurious noise in the patch.

+ /* check that at least some statistics were requested */
+ if (!build_dependencies)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("no statistics type (dependencies) was requested")));
So, WITH (dependencies) is mandatory in any case. Why not just
dropping it from the first cut then?

pg_mv_stats shows only the attribute numbers of the columns it has
stats on, I think that those should be the column names. [...after a
while...], as it is mentioned here:
+ * TODO Would be nice if this printed column names (instead of just attnums).

Does this work properly with DDL deparsing? If yes, could it be
possible to add tests in test_ddl_deparse? This is a new object type,
so those look necessary I think.

Statistics definition reorder the columns by itself depending on their
order. For example:
create table aa (a int, b int);
create statistics aas on aa(b, a) with (dependencies);
\d aa
"public.aas" (dependencies) ON (a, b)
As this defines a correlation between multiple columns, isn't it wrong
to assume that (b, a) and (a, b) are always the same correlation? I
don't recall such properties as being always commutative (old
memories, I suck at stats in general). [...reading README...] So this
is caused by the implementation limitations that only limit the
analysis between interactions of two columns. Still it seems incorrect
to reorder the user-visible portion.

The comment on top of get_relation_info needs to be updated to mention
that mvstatlist gets fetched as well.

+ while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+ /* TODO maybe include only already built statistics? */
+ result = insert_ordered_oid(result, HeapTupleGetOid(htup));
I haven't looked at the rest yet of the series yet, but I'd think that
including the ones not built may be a good idea to let caller do
itself more filtering. Of course this depends on the next series...

+typedef struct MVDependencyData
+{
+ int nattributes; /* number of attributes */
+ int16 attributes[1]; /* attribute numbers */
+} MVDependencyData;
You need to look for FLEXIBLE_ARRAY_MEMBER here. Same for MVDependenciesData.

+++ b/src/test/regress/serial_schedule
@@ -167,3 +167,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: mv_dependencies
This test is not listed in parallel_schedule.

s/Apllying/Applying/

There is a lot of mumbo-jumbo regarding the way dependencies are
stored with mainly serialize_mv_dependencies and
deserialize_mv_dependencies that operates them from bytea/dep trees.
That's not cool and not portable because pg_mv_statistic represents
that as pure bytea. I would suggest creating a generic data type that
does those operations, named like pg_dependency_tree and then use that
in those new catalogs. pg_node_tree is a precedent of such a thing.
New features could as well make use of this new data type of we are
able to design that in a way generic enough, so that would be a base
patch that the current 0002 applies on top of.

Regarding psql:
- The new commands lack psql completion, that would ease the use of
the new commands.
- Would it make sense to have a backslash command to show the list of
statistics?

Congratulations. I just looked at 25% of the overall patch and my mind
is already blown away, but I am catching up with the rest...
--
Michael

In response to

Re: multivariate statistics (v19) at 2016-08-03 01:58:13 from Tomas Vondra

Responses

Re: multivariate statistics (v19) at 2016-08-10 11:33:25 from Tomas Vondra
Re: multivariate statistics (v19) at 2016-08-15 20:50:17 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tsunakawa, Takayuki	2016-08-10 04:51:41	Re: Wait events monitoring future development
Previous Message	Dilip Kumar	2016-08-10 04:34:44	Re: [sqlsmith] Failed assertion in joinrels.c