Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-12-12 21:50:05
Message-ID: 72eeb3d5-c406-93b0-8ff8-11b31789f683@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi Amit,

attached is v21 of the patch series, rebased to current master
(resolving the duplicate OID and a few trivial merge conflicts), and
also fixing some of the issues you reported.

On 12/12/2016 12:26 PM, Amit Langote wrote:
>
> Hi Tomas,
>
> On 2016/10/30 4:23, Tomas Vondra wrote:
>> Hi,
>>
>> Attached is v20 of the multivariate statistics patch series, doing mostly
>> the changes outlined in the preceding e-mail from October 11.
>>
>> The patch series currently has these parts:
>>
>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
>> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
>> * 0004 : (PATCH) selectivity estimation using functional dependencies
>> * 0005 : (PATCH) multivariate MCV lists
>> * 0006 : (PATCH) multivariate histograms
>> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
>> * 0008 : (WIP) use multiple statistics for estimation
>> * 0009 : (WIP) psql tab completion basics
>
> Unfortunately, this failed to compile because of the duplicate_oids error.
> Partitioning patch consumed same OIDs as used in this patch.
>

Fixed, should compile fine now (even each patch in the series).

> I will try to read the patches in some more detail, but in the meantime,
> here are some comments/nitpicks on the documentation:
>
> No updates to doc/src/sgml/catalogs.sgml?
>

Good point. I've added a section for the pg_mv_statistic catalog.

> + <para>
> + The examples presented in <xref linkend="row-estimation-examples"> used
> + statistics about individual columns to compute selectivity estimates.
> + When estimating conditions on multiple columns, the planner assumes
> + independence and multiplies the selectivities. When the columns are
> + correlated, the independence assumption is violated, and the estimates
> + may be seriously off, resulting in poor plan choices.
> + </para>
>
> The term independence is used in isolation - independence of what?
> Independence of the distributions of values in separate columns? Also,
> the phrase "seriously off" could perhaps be replaced by more rigorous
> terminology; it might be unclear to some readers. Perhaps: wildly
> inaccurate, :)
>

I've reworded this to "independence of the conditions" and "off by
several orders of magnitude". Hope that's better.

> +<programlisting>
> +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
> + QUERY PLAN
> +-------------------------------------------------------------------------------------------------
> + Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual
> time=0.031..2.870 rows=100 loops=1)
> + Filter: (a = 1)
> + Rows Removed by Filter: 9900
> + Planning time: 0.092 ms
> + Execution time: 3.103 ms
>
> Is there a reason why examples in "67.2. Multivariate Statistics" (like
> the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
> Examples" (also, other relevant chapters) uses just EXPLAIN.
>

Yes, the reason is that while 67.1 shows how the optimizer estimates row
counts and constructs the plan (so EXPLAIN is sufficient), 67.2
demonstrates how the estimates are inaccurate with respect to the actual
row counts. Thus the EXPLAIN ANALYZE.

> + the final 0.01% estimate. The plan however shows that this results in
> + a significant under-estimate, as the actual number of rows matching the
>
> s/under-estimate/underestimate/g
>
> + <para>
> + For additional details about multivariate statistics, see
> + <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
> + <literal>README</> for each type of statistics, mentioned in the following
> + sections.
> + </para>
>
> Referring to source tree READMEs seems novel around this portion of the
> documentation, but I think not too far away, there are some references.
> This is under the VII. Internals chapter anyway, so that might be OK.
>

I think the there's a threshold when the detail becomes too detailed for
the sgml docs - say, when it discusses some implementation details, at
which point a README is more appropriate. I don't know if I got it
entirely right with the docs, though, so perhaps some bits may move in
either direction.

> In any case, s/README.statsc/README.stats/g
>
> Also, s/additional README/additional READMEs/g (tags omitted for brevity)
>
> + used in definitions of database normal forms. When simplified, saying
> that
> + <literal>b</> is functionally dependent on <literal>a</> means that
>

Fixed.

> Maybe, s/When simplified/In simple terms/g
>
> + In normalized databases, only functional dependencies on primary keys
> + and super keys are allowed. In practice however many data sets are not
> + fully normalized, for example thanks to intentional denormalization for
> + performance reasons. The table <literal>t</> is an example of a data
> + with functional dependencies. As <literal>a=b</> for all rows in the
> + table, <literal>a</> is functionally dependent on <literal>b</> and
> + <literal>b</> is functionally dependent on <literal>a</literal>.
>
> "super keys" sounds like a new term.
>

Actually no, "super key" is a term defined in normal forms.

> s/for example thanks to/for example, thanks to/g (or due to instead of
> thanks to)
>
> How about: s/an example of a data with/an example of a schema with/g
>

I think "example of data set" is better. Reworded.

> Perhaps, s/a=b/a = b/g (additional white space)
>
> + Similarly to per-column statistics, multivariate statistics are stored in
>
> I notice that "similar to" is used more often than "similarly to". But
> that might be OK.
>

Not sure.

> + This shows that the statistics is defined on table <structname>t</>,
>
> Perhaps: the statistics is -> the statistics are or the statistic is
>

As that paragraph is only about functional dependencies, I think
'statistic is' is more appropriate.

> + lists <structfield>attnums</structfield> of the columns (references
> + <structname>pg_attribute</structname>).
>
> While this text may be OK on the catalog description page, it might be
> better to expand attnums here as "attribute numbers" dropping the
> parenthesized phrase altogether.
>

Not sure. I've reworded it like this:

This shows that the statistic is defined on table <structname>t</>,
<structfield>attnums</structfield> lists attribute numbers of columns
(references <structname>pg_attribute</structname>). It also shows

Does that sound better?

> +<programlisting>
> +SELECT pg_mv_stats_dependencies_show(stadeps)
> + FROM pg_mv_statistic WHERE staname = 's1';
> +
> + pg_mv_stats_dependencies_show
> +-------------------------------
> + (1) => 2, (2) => 1
> +(1 row)
> +</programlisting>
>
> Couldn't this somehow show actual column names, instead of attribute numbers?
>

Yeah, I was thinking about that too. The trouble is that's table-level
metadata, so we don't have that kind of info serialized within the data
type (e.g. because it would not handle column renames etc.).

It might be possible to explicitly pass the table OID as a parameter of
the function, but it seemed a bit ugly to me.

FWIW, as I wrote in this thread, the place where this patch series needs
feedback most desperately is integration into the optimizer. Currently
all the magic happens in clausesel.c and does not leave it.I think it
would be good to move some of that (particularly the choice of
statistics to apply) to an earlier stage, and store the information
within the plan tree itself, so that it's available outside clausesel.c
(e.g. for EXPLAIN - showing which stats were picked seems useful).

I was thinking it might work similarly to the foreign key estimation
patch (100340e2). It might even be more efficient, as the current code
may end repeating the selection of statistics multiple times. But
enriching the plan tree turned out to be way more invasive than I'm
comfortable with (but maybe that'd be OK).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
multivariate-stats-v21.tgz application/x-compressed-tar 142.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2016-12-12 22:27:30 Re: snapbuild woes
Previous Message Kevin Grittner 2016-12-12 21:47:44 Re: [OSSTEST PATCH 0/1] PostgreSQL db: Retry on constraint violation