Re: multivariate statistics v14

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics v14
Date: 2016-03-21 10:30:48
Message-ID: 66a87077-0ad5-8a93-296d-f74ad0df5538@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/21/2016 04:34 AM, Alvaro Herrera wrote:
> Another skim on 0002:
>
> reference.sgml is missing a call to &alterStatistic.
>
> ObjectProperty[] contains a comment that the ACL is "same as relation",
> but is that still correct, given that now stats may be related to more
> than one relation? Do we even know what the rules for ACLs on
> cross-relation stats are? One very simple way to get around this is to
> dictate that all the rels must have the same owner. Perhaps we're not
> considering the multi-relation case yet?

As I wrote in response to Robert's message, I don't think we need ACLs
for statistics - the user should be able to use them when they can
access all the underlying relations (in a query). For ALTER STATISTICS
the (owner || superuser) check should be enough, right?

>
> We have this FIXME comment in do_analyze_rel:
>
> + * FIXME This sample sizing is mostly OK when computing stats for
> + * individual columns, but when computing multi-variate stats
> + * for multivariate stats (histograms, mcv, ...) it's rather
> + * insufficient. For stats on multiple columns / complex stats
> + * we need larger sample sizes, because we need to build more
> + * detailed stats (more MCV items / histogram buckets) to get
> + * good accuracy. Maybe it'd be appropriate to use samples
> + * proportional to the table (say, 0.5% - 1%) instead of a
> + * fixed size might be more appropriate. Also, this should be
> + * bound to the requested statistics size - e.g. number of MCV
> + * items or histogram buckets should require several sample
> + * rows per item/bucket (so the sample should be k*size).
>
> Maybe this merits more discussion. Right now we have an upper bound on
> how much to scan for analyze; if we introduce the idea of scanning a
> percentage of the relation, the time to analyze very large relations
> could increase significantly. Do we have an idea of what to do for
> this? For instance, a rule that would make me comfortable would say to
> scan a sample 3x the current size when you have a mvstats on 3 columns;
> then the size of fraction to scan is still bounded. But does that
> actually work? From the wording of this comment, I assume you don't
> actually know.

Yeah. I think more discussion is needed, because I myself am not sure
the FIXME is actually correct. For now I think we're OK with using the
same logic as statistics on a single column (300 * target).

>
> In this block (CreateStatistics)
> + /* look for duplicities */
> + for (i = 0; i < numcols; i++)
> + for (j = 0; j < numcols; j++)
> + if ((i != j) && (attnums[i] == attnums[j]))
> + ereport(ERROR,
> + (errcode(ERRCODE_UNDEFINED_COLUMN),
> + errmsg("duplicate column name in statistics definition")));
>
> isn't it easier to have the inner loop go from i+1 to numcols?

It probably is.

>
> I wonder if this is sensible with multi-relation statistics:
> + /*
> + * Store a dependency too, so that statistics are dropped on DROP TABLE
> + */
> + parentobject.classId = RelationRelationId;
> + parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
> + parentobject.objectSubId = 0;
> + childobject.classId = MvStatisticRelationId;
> + childobject.objectId = statoid;
> + childobject.objectSubId = 0;
>
> I suppose the idea is to drop the stats if any of the rels they are for
> is dropped.

What do you mean by sensible? I mean, we don't support multiple tables
at this point (except for choosing a syntax that should allow that), but
the code assumes a single relation on a few places (like this one).

>
> Right after that you create a dependency on the schema. Is that
> necessary? Since you have the dependency on the relation, the stats
> would be dropped by recursion.

Hmmmm, that's probably right. Also, now that I think about it, it
probably gets broken after ALTER STATISTICS ... SET SCHEMA, because the
code does not remove the old dependency (and does not create a new one).

>
> Why are you #include'ing builtins.h everywhere?

Stupidity.

>
> RelationGetMVStatList() needs a comment.

OK.

>
> Please get rid of common.h. It's totally unlike the way we structure
> our header files. We don't keep headers in src/backend; they're all in
> src/include. One reason is that the latter gets installed as a whole in
> include/server, which this file will not be. This file may be necessary
> to build some extensions in the future, for example.

OK, I'll rework that and move it to src/include/.

>
> In mvstats.h, please mark function prototypes as "extern".
>
> Many files need a pgindent pass.

OK.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Artur Zakirov 2016-03-21 11:04:35 Re: [PATCH] Phrase search ported to 9.6
Previous Message Andres Freund 2016-03-21 10:10:57 Re: Performance degradation in commit ac1d794