Re: multivariate statistics (v25)

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, David Fetter <david(at)fetter(dot)org>, dean(dot)a(dot)rasheed(at)gmail(dot)com, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v25)
Date: 2017-03-13 10:00:57
Message-ID: CAKJS1f8yq7r28OhjuJrqFg4EpdhQ1L9LRnNHhec5QCYrofVEVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 March 2017 at 03:53, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> This time with the attachments ....

It's been a long while since I looked at this patch, but I'm now taking
another look.

I've made a list of stuff I've found from making my first pass on 0001 and
0002. Some of the stuff may seem a little pedantic, so apologies about
those ones. I merely SET nit_picking_threshold TO 0; and reviewed.

Here goes:

0001:

+ RestrictInfo *rinfo = (RestrictInfo*)node;

and

+ RestrictInfo *rinfo = (RestrictInfo *)node;
+
+ return expression_tree_walker((Node*)rinfo->clause,
+ pull_varattnos_walker,
+ (void*) context);

spacing incorrect. Please space after type name in casts and after the
closing parenthesis.

0002:

+ dropped as well. Multivariate statistics referencing the column will
+ be dropped only if there would remain a single non-dropped column.

I was initially confused by this. I think it should worded as:

"Multivariate statistics referencing the dropped column will also be
removed if the removal of the column would cause the statistics to contain
data for only a single column"

I had been confused as I'd been thinking of dropping multiple columns at
once with the same command, and only 1 column remained in the table. So I
think it's best to clarify you mean the statistic here.

+ OCLASS_STATISTICS /* pg_statistics_ext */

I wonder if this should be named: OCLASS_STATISTICEXT. The comment is also
incorrect and should read "pg_statistic_ext" (without 's')

I tried to perform a test in this area and received an error:

postgres=# create table ab1 (a int, b int);
CREATE TABLE
postgres=# create statistics ab1_a_b_stats on (a,b) from ab1;
CREATE STATISTICS
postgres=# alter table ab1 drop column a;
ALTER TABLE
postgres=# drop table ab1;
ERROR: cache lookup failed for statistics 16399

+ When estimating conditions on multiple columns, the planner assumes
+ independence of the conditions and multiplies the selectivities. When
the
+ columns are correlated, the independence assumption is violated, and the
+ estimates may be off by several orders of magnitude, resulting in poor
+ plan choices.

I don't think the assumption is violated. We still assume that they're
independent, which is incorrect. Nothing gets violated.

Perhaps it would be more accurate to write:

"When estimating the selectivity of conditions over multiple columns, the
planner normally assumes each condition is independent of other conditions,
and simply multiplies the selectivity estimates of each condition together
to produce a final selectivity estimation for all conditions. This method
can often lead to inaccurate row estimations when the conditions have
dependencies on one another. Such misestimations can result poor plan
choices being made."

+ using <command>CREATE STATISTICS</> command.

using the ...

+ As explained in <xref linkend="planner-stats">, the planner can
determine
+ cardinality of <structname>t</structname> using the number of pages and
+ rows is looked up in <structname>pg_class</structname>:

perhaps "rows is" should become "rows as" or "rows which are".

+ * delete multi-variate statistics
+ */
+ RemoveStatisticsExt(relid, 0);

I think it should be "delete extended statistics"

Should this not be rejected?

postgres=# create view v1 as select 1 a, 2 b;
CREATE VIEW
postgres=# create statistics v1_a_stats on (a,b) from v1;
CREATE STATISTICS

and this?

postgres=# create sequence test_seq;
CREATE SEQUENCE
postgres=# select * from test_seq;
last_value | log_cnt | is_called
------------+---------+-----------
1 | 0 | f
(1 row)
postgres=# create statistics test_seq_stats on (last_value,log_cnt) from
test_seq;
CREATE STATISTICS

The patch does claim:

+ /* extended stats are supported on tables and matviews */

So I guess it should be disallowed.

+ /* OBJECT_STATISTICS */
+ {
+ "statistics", OBJECT_STATISTICS

Maybe this should be changed to be OBJECT_STATISTICEXT */. Doing it this
way would close the door a bit on pg_depends records existing for
pg_statistic.

A quick test shows a problem here:

postgres=# create table ab (a int, b int);
CREATE TABLE
postgres=# create statistics ab_a_b_stats on (a,b) from ab;
CREATE STATISTICS
postgres=# create statistics ab_a_b_stats1 on (a,b) from ab;
CREATE STATISTICS
postgres=# alter statistics ab_a_b_stats1 rename to ab_a_b_stats;
ERROR: unsupported object class 3381

+/*****************************************************************************
+ *
+ * QUERY :
+ * CREATE STATISTICS stats_name ON relname (columns) WITH (options)
+ *
+
*****************************************************************************/

Old Syntax?

+ $$ = (Node *)n;

Incorrect spacing.

+ * The returned list is guaranteed to be sorted in order by OID, although
+ * this is not currently needed.

hmm, whats the tie-breaker going to be for:

CREATE TABLE abc (a int, b int, c int);
create statistics abc_ab_stats (a,b) from abc;
create statistics abc_bc_stats (b,c) from abc;

select * from abc where a=1 and b=1 and c=1;

I've not gotten to that part of the code yet, but reading the comment made
me wonder how you're handling this. I think predictable is a good way, so
that would require some ordering on this list... I presume.

+ * happen if the statistics has fewer attributes than we have Vars.

"statistics" is plural, so "has" should be "have"

although I see you mix the plurals up a few lines later and write in
singular form.

+ /* check that all Vars are covered by the statistic */

This one is more of a question:

+ bool found;
+ double ndist = find_ndistinct(root, rel, varinfos, &found);

would it be better to return the bool and pass the &ndist here? That way
you could simply write:

if (!find_ndistinct(root, rel, varinfos, &reldistinct))
clamp *= 0.1;

@@ -3450,6 +3467,7 @@ estimate_num_groups(PlannerInfo *root, List
*groupExprs, double input_rows,
clamp = rel->tuples;
}
}
+

Adds a new line by mistake.

+ /*
+ * Only ndistinct stats covering all Vars are acceptable, which can't
+ * happen if the statistics has fewer attributes than we have Vars.
+ */
+ if (bms_num_members(attnums) > info->stakeys->dim1)
+ continue;

bms_num_members() done inside loop. Would you say it's OK to assume the
compiler will do that before the loop?, or do you think it's best to set it
before looping? We already know we're going to loop at least once, since
we'd have short circuited at the start of the function otherwise.

+ k = -1;
+ while ((k = bms_next_member(attnums, k)) >= 0)
+ {
+ bool attr_found = false;
+ for (i = 0; i < info->stakeys->dim1; i++)
+ {
+ if (info->stakeys->values[i] == k)
+ {
+ attr_found = true;
+ break;
+ }
+ }
+
+ /* found attribute not covered by this ndistinct stats, skip */
+ if (!attr_found)
+ {
+ matches = false;
+ break;
+ }
+ }

Would it be better just to stuff info->stakeys->values into a bitmapset and
check its a subset of attnums? It would mean allocating memory in the loop,
so maybe you think otherwise, but in that case maybe StatisticExtInfo
should store the bitmapset?

+ if (! matches)
+ continue;

extra whitespace after !

+ /* not the right item (different number of attributes) */
+ if (item->nattrs != bms_num_members(attnums))
+ continue;

again using bms_num_members() inside a loop when its known before the loop.

+ Assert(!(*found));

This confused me for a minute as I mistakenly read this as
Assert((*found)); can you comment this to say something along the lines of
the fact that we should have returned already if we found a match.

+ appendPQExpBuffer(&buf, "(dependencies)");

I think it's better practice to use appendPQExpBufferStr() when there's no
formatting. It'll perform marginally better, which might not be important
here, but it sets a better example for people to follow when performance is
more critical.

+ List *keys; /* String nodes naming referenced column(s) */

column(s) should read columns. 's' is not optional.

+ bool rd_statvalid; /* state of rd_statlist: true/false */

so bool can only be true or false. Good to know ;-) the comment is
probably useless, can you improve?

+ change the definition of a extended statistics

"a" should be "an", Also is statistics plural here. It's commonly mixed up
in the patch. I think it needs standardised. I personally think if you're
speaking of a single pg_statatic_ext row, then it should be singular. Yet,
I'm aware you're using plural for the CREATE STATISTICS command, to me that
feels a bit like: CREATE TABLES mytable (); am I somehow thinking wrongly
somehow here?

+ The name (optionally schema-qualified) of a statistics to be
altered.

"a" should be "the"

+ If a schema name is given (for example, <literal>CREATE STATISTICS
+ myschema.mystat ...</>) then the statistics is created in the specified
+ schema. Otherwise it is created in the current schema. The name of

What's created in the current schema? I thought this was just for naming?

+ <para>
+ To be able to create a table, you must have <literal>USAGE</literal>
+ privilege on all column types or the type in the <literal>OF</literal>
+ clause, respectively.
+ </para>

"create a table" ? create an extended statistic ?

+ <title>Examples</title>
+
+ <para>
+ ...
+ </para>

Why are the examples missing? I've not looked beyond patch 0002 yet, but
I'd have assumed 0002 should be commitable without requiring later patches
to make it correct.

+ * statscmds.c
+ * Commands for creating and altering extended statistics
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

2017.

+ * statistics might work with equality only.

extra space

+ /* costruction of array of enabled statistic */

construction?

+ atttuple = SearchSysCacheAttName(relid, attname);

+

+ if (!HeapTupleIsValid(atttuple))

+ ereport(ERROR,

+ (errcode(ERRCODE_UNDEFINED_COLUMN),

+ errmsg("column \"%s\" referenced in statistics does not exist",

+ attname)));

+

+ /* more than STATS_MAX_DIMENSIONS columns not allowed */

+ if (numcols >= STATS_MAX_DIMENSIONS)

+ ereport(ERROR,

+ (errcode(ERRCODE_TOO_MANY_COLUMNS),

+ errmsg("cannot have more than %d keys in statistics",

+ STATS_MAX_DIMENSIONS)));

+

+ attnums[numcols] = ((Form_pg_attribute) GETSTRUCT(atttuple))->attnum;

+ ReleaseSysCache(atttuple);

Looks like a syscache leak. No?

+ /*
+ * Delete the pg_proc tuple.
+ */
+ relation = heap_open(StatisticExtRelationId, RowExclusiveLock);

pg_proc?

+ * pg_statistic_ext.h
+ * definition of the system "extended statistic" relation
(pg_statistic_ext)
+ * along with the relation's initial contents.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group

2017

+ * stats.h
+ * Multivariate statistics and selectivity estimation functions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group

2017

"Multivariate" should be "Extended". My justification here is
that stats_are_built() is contained within, which is used
in get_relation_statistics() which is not specific to MV stats.

0003:

No more time today. Will try and get to those soon.

Setting to waiting on author in the meantime.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anastasia Lubennikova 2017-03-13 10:01:08 Re: IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements
Previous Message Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?= 2017-03-13 09:35:39 Re: make check-world output