Re: multivariate statistics (v19)

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-12-12 11:26:33
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Tomas,

On 2016/10/30 4:23, Tomas Vondra wrote:
> Hi,
> Attached is v20 of the multivariate statistics patch series, doing mostly
> the changes outlined in the preceding e-mail from October 11.
> The patch series currently has these parts:
> * 0001 : (FIX) teach pull_varno about RestrictInfo
> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
> * 0004 : (PATCH) selectivity estimation using functional dependencies
> * 0005 : (PATCH) multivariate MCV lists
> * 0006 : (PATCH) multivariate histograms
> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
> * 0008 : (WIP) use multiple statistics for estimation
> * 0009 : (WIP) psql tab completion basics

Unfortunately, this failed to compile because of the duplicate_oids error.
Partitioning patch consumed same OIDs as used in this patch.

I will try to read the patches in some more detail, but in the meantime,
here are some comments/nitpicks on the documentation:

No updates to doc/src/sgml/catalogs.sgml?

+ <para>
+ The examples presented in <xref linkend="row-estimation-examples"> used
+ statistics about individual columns to compute selectivity estimates.
+ When estimating conditions on multiple columns, the planner assumes
+ independence and multiplies the selectivities. When the columns are
+ correlated, the independence assumption is violated, and the estimates
+ may be seriously off, resulting in poor plan choices.
+ </para>

The term independence is used in isolation - independence of what?
Independence of the distributions of values in separate columns? Also,
the phrase "seriously off" could perhaps be replaced by more rigorous
terminology; it might be unclear to some readers. Perhaps: wildly
inaccurate, :)

+ Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual
time=0.031..2.870 rows=100 loops=1)
+ Filter: (a = 1)
+ Rows Removed by Filter: 9900
+ Planning time: 0.092 ms
+ Execution time: 3.103 ms

Is there a reason why examples in "67.2. Multivariate Statistics" (like
the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
Examples" (also, other relevant chapters) uses just EXPLAIN.

+ the final 0.01% estimate. The plan however shows that this results in
+ a significant under-estimate, as the actual number of rows matching the


+ <para>
+ For additional details about multivariate statistics, see
+ <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
+ <literal>README</> for each type of statistics, mentioned in the following
+ sections.
+ </para>

Referring to source tree READMEs seems novel around this portion of the
documentation, but I think not too far away, there are some references.
This is under the VII. Internals chapter anyway, so that might be OK.

In any case, s/README.statsc/README.stats/g

Also, s/additional README/additional READMEs/g (tags omitted for brevity)

+ used in definitions of database normal forms. When simplified, saying
+ <literal>b</> is functionally dependent on <literal>a</> means that

Maybe, s/When simplified/In simple terms/g

+ In normalized databases, only functional dependencies on primary keys
+ and super keys are allowed. In practice however many data sets are not
+ fully normalized, for example thanks to intentional denormalization for
+ performance reasons. The table <literal>t</> is an example of a data
+ with functional dependencies. As <literal>a=b</> for all rows in the
+ table, <literal>a</> is functionally dependent on <literal>b</> and
+ <literal>b</> is functionally dependent on <literal>a</literal>.

"super keys" sounds like a new term.

s/for example thanks to/for example, thanks to/g (or due to instead of
thanks to)

How about: s/an example of a data with/an example of a schema with/g

Perhaps, s/a=b/a = b/g (additional white space)

+ Similarly to per-column statistics, multivariate statistics are stored in

I notice that "similar to" is used more often than "similarly to". But
that might be OK.

+ This shows that the statistics is defined on table <structname>t</>,

Perhaps: the statistics is -> the statistics are or the statistic is

+ lists <structfield>attnums</structfield> of the columns (references
+ <structname>pg_attribute</structname>).

While this text may be OK on the catalog description page, it might be
better to expand attnums here as "attribute numbers" dropping the
parenthesized phrase altogether.

+SELECT pg_mv_stats_dependencies_show(stadeps)
+ FROM pg_mv_statistic WHERE staname = 's1';
+ pg_mv_stats_dependencies_show
+ (1) => 2, (2) => 1
+(1 row)

Couldn't this somehow show actual column names, instead of attribute numbers?

Will read more later.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Rafa de la Torre 2016-12-12 11:35:08 Re: Fix for segfault in plpython's exception handling
Previous Message Heikki Linnakangas 2016-12-12 10:52:21 Re: Password identifiers, protocol aging and SCRAM protocol