Quick Links

Re: multivariate statistics v14

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: multivariate statistics v14
Date:	2016-03-21 03:34:40
Message-ID:	20160321033440.GA350186@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Another skim on 0002:

reference.sgml is missing a call to &alterStatistic.

ObjectProperty[] contains a comment that the ACL is "same as relation",
but is that still correct, given that now stats may be related to more
than one relation? Do we even know what the rules for ACLs on
cross-relation stats are? One very simple way to get around this is to
dictate that all the rels must have the same owner. Perhaps we're not
considering the multi-relation case yet?

We have this FIXME comment in do_analyze_rel:

+ * FIXME This sample sizing is mostly OK when computing stats for
+ * individual columns, but when computing multi-variate stats
+ * for multivariate stats (histograms, mcv, ...) it's rather
+ * insufficient. For stats on multiple columns / complex stats
+ * we need larger sample sizes, because we need to build more
+ * detailed stats (more MCV items / histogram buckets) to get
+ * good accuracy. Maybe it'd be appropriate to use samples
+ * proportional to the table (say, 0.5% - 1%) instead of a
+ * fixed size might be more appropriate. Also, this should be
+ * bound to the requested statistics size - e.g. number of MCV
+ * items or histogram buckets should require several sample
+ * rows per item/bucket (so the sample should be k*size).

Maybe this merits more discussion. Right now we have an upper bound on
how much to scan for analyze; if we introduce the idea of scanning a
percentage of the relation, the time to analyze very large relations
could increase significantly. Do we have an idea of what to do for
this? For instance, a rule that would make me comfortable would say to
scan a sample 3x the current size when you have a mvstats on 3 columns;
then the size of fraction to scan is still bounded. But does that
actually work? From the wording of this comment, I assume you don't
actually know.

In this block (CreateStatistics)
+ /* look for duplicities */
+ for (i = 0; i < numcols; i++)
+ for (j = 0; j < numcols; j++)
+ if ((i != j) && (attnums[i] == attnums[j]))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ errmsg("duplicate column name in statistics definition")));

isn't it easier to have the inner loop go from i+1 to numcols?

I wonder if this is sensible with multi-relation statistics:
+ /*
+ * Store a dependency too, so that statistics are dropped on DROP TABLE
+ */
+ parentobject.classId = RelationRelationId;
+ parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
+ parentobject.objectSubId = 0;
+ childobject.classId = MvStatisticRelationId;
+ childobject.objectId = statoid;
+ childobject.objectSubId = 0;

I suppose the idea is to drop the stats if any of the rels they are for
is dropped.

Right after that you create a dependency on the schema. Is that
necessary? Since you have the dependency on the relation, the stats
would be dropped by recursion.

Why are you #include'ing builtins.h everywhere?

RelationGetMVStatList() needs a comment.

Please get rid of common.h. It's totally unlike the way we structure
our header files. We don't keep headers in src/backend; they're all in
src/include. One reason is that the latter gets installed as a whole in
include/server, which this file will not be. This file may be necessary
to build some extensions in the future, for example.

In mvstats.h, please mark function prototypes as "extern".

Many files need a pgindent pass.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: multivariate statistics v14 at 2016-03-20 23:34:51 from Tomas Vondra

Responses

Re: multivariate statistics v14 at 2016-03-21 09:34:07 from Robert Haas
Re: multivariate statistics v14 at 2016-03-21 10:30:48 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2016-03-21 04:12:38	Re: Performance degradation in commit ac1d794
Previous Message	Amit Kapila	2016-03-21 03:29:49	Re: Patch: fix lock contention for HASHHDR.mutex