Re: multivariate statistics (v19)

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, ideriha(dot)takeshi(at)jp(dot)fujitsu(dot)com, dilipbalaut(at)gmail(dot)com, Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp, dean(dot)a(dot)rasheed(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, ishii(at)postgresql(dot)org, david(at)pgmasters(dot)net, michael(dot)paquier(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, petr(at)2ndquadrant(dot)com, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-02-06 22:11:57
Message-ID: 20170206221157.54lzliw3wjhskb6w@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Looking at 0003, I notice that gram.y is changed to add a WITH ( .. )
clause. If it's not specified, an error is raised. If you create
stats with (ndistinct) then you can't alter it later to add
"dependencies" or whatever; unless I misunderstand, you have to drop the
statistics and create another one. Probably in a forthcoming patch we
should have ALTER support to add a stats type.

Also, why isn't the default to build everything, rather than nothing?

BTW, almost everything in the backend could be inside "utils/", so let's
not do that -- let's just create src/backend/statistics/ for all your
code.

Here a few notes while reading README.dependencies -- some typos, two
questions.

diff --git a/src/backend/utils/mvstats/README.dependencies b/src/backend/utils/mvstats/README.dependencies
index 908f094..7f3ed3d 100644
--- a/src/backend/utils/mvstats/README.dependencies
+++ b/src/backend/utils/mvstats/README.dependencies
@@ -36,7 +36,7 @@ design choice to model the dataset in denormalized way, either because of
performance or to make querying easier.


-soft dependencies
+Soft dependencies
-----------------

Real-world data sets often contain data errors, either because of data entry
@@ -48,7 +48,7 @@ rendering the approach mostly useless even for slightly noisy data sets, or
result in sudden changes in behavior depending on minor differences between
samples provided to ANALYZE.

-For this reason the statistics implementes "soft" functional dependencies,
+For this reason the statistics implements "soft" functional dependencies,
associating each functional dependency with a degree of validity (a number
number between 0 and 1). This degree is then used to combine selectivities
in a smooth manner.
@@ -75,6 +75,7 @@ The algorithm also requires a minimum size of the group to consider it
consistent (currently 3 rows in the sample). Small groups make it less likely
to break the consistency.

+## What is it that we store in the catalog?

Clause reduction (planner/optimizer)
------------------------------------
@@ -95,12 +96,12 @@ example for (a,b,c) we first use (a,b=>c) to break the computation into
and then apply (a=>b) the same way on P(a=?,b=?).


-Consistecy of clauses
+Consistency of clauses
---------------------

Functional dependencies only express general dependencies between columns,
without referencing particular values. This assumes that the equality clauses
-are in fact consistent with the functinal dependency, i.e. that given a
+are in fact consistent with the functional dependency, i.e. that given a
dependency (a=>b), the value in (b=?) clause is the value determined by (a=?).
If that's not the case, the clauses are "inconsistent" with the functional
dependency and the result will be over-estimation.
@@ -111,6 +112,7 @@ set will be empty, but we'll estimate the selectivity using the ZIP condition.

In this case the default estimation based on AVIA principle happens to work
better, but mostly by chance.
+## what is AVIA principle?

This issue is the price for the simplicity of functional dependencies. If the
application frequently constructs queries with clauses inconsistent with

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2017-02-06 22:49:17 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
Previous Message Peter Eisentraut 2017-02-06 21:35:54 Re: Provide list of subscriptions and publications in psql's completion