Re: multivariate statistics (v19)

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: ideriha(dot)takeshi(at)jp(dot)fujitsu(dot)com
Cc: tomas(dot)vondra(at)2ndquadrant(dot)com, dilipbalaut(at)gmail(dot)com, Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp, dean(dot)a(dot)rasheed(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, ishii(at)postgresql(dot)org, david(at)pgmasters(dot)net, michael(dot)paquier(at)gmail(dot)com, alvherre(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, petr(at)2ndquadrant(dot)com, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-01-26 11:01:07
Message-ID: 20170126.200107.11536715.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, I'll return on this since this should welcome more eyeballs.

At Thu, 26 Jan 2017 09:03:10 +0000, "Ideriha, Takeshi" <ideriha(dot)takeshi(at)jp(dot)fujitsu(dot)com> wrote in <4E72940DA2BF16479384A86D54D0988A565822A9(at)G01JPEXMBKW04>
> Hi
>
> When you have time, could you rebase the pathes?
> Some patches cannot be applied to the current HEAD.

For those who are willing to look this,
352a24a1f9d6f7d4abb1175bfd22acc358f43140 breaks this. So just
before it can accept this patches cleanly.

> 0001 patch can be applied but the following 0002 patch cannot be.
>
> I've just started reading your patch (mainly docs and README, not yet source code.)
>
> Though these are minor things, I've found some typos or mistakes in the document and README.
>
> >+ statistics on the table. The statistics will be created in the in the
> >+ current database. The statistics will be owned by the user issuing
>
> Regarding line 629 at 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch,
> there is a double "in the".
>
> >+ knowledge of a value in the first column is sufficient for detemining the
> >+ value in the other column. Then functional dependencies are built on those
>
> Regarding line 701 at 0002-PATCH,
> "determining" is mistakenly spelled "detemining".
>
>
> >@@ -0,0 +1,98 @@
> >+Multivariate statististics
> >+==========================
>
> Regarding line 2415 at 0002-PATCH, "statististics" should be statistics
>
>
> >+ <refnamediv>
> >+ <refname>CREATE STATISTICS</refname>
> >+ <refpurpose>define a new statistics</refpurpose>
> >+ </refnamediv>
>
> >+ <refnamediv>
> >+ <refname>DROP STATISTICS</refname>
> >+ <refpurpose>remove a statistics</refpurpose>
> >+ </refnamediv>
>
> Regarding line 612 and 771 at 0002-PATCH,
> I assume saying "multiple statistics" explicitly is easier to understand to users
> since these commands don't for the statistics we already have in the pg_statistics in my understanding.
>
> >+ [1] http://en.wikipedia.org/wiki/Database_normalization
>
> Regarding line 386 at 0003-PATCH, is it better to change this link to this one:
> https://en.wikipedia.org/wiki/Functional_dependency ?
> README.dependencies cites directly above link.
>
> Though I pointed out these typoes and so on,
> I believe these feedback are less priority compared to the source code itself.
>
> So please work on my feedback if you have time.

README.dependencies

> dependencies, and for each one count the number of rows rows consistent it.
"of rows rows consistent it" => "or rows consistent with it"?

> are in fact consistent with the functinal dependency, i.e. that given the a

"that given the a" => "that given a" ?

dependencies.c:

dependency_dgree():

- The k is assumed larger than 1. I think assertion is required.

- "/* end of the preceding group */" seems to be better if it
is just after the "if (multi_sort.." currently just after it.

- The following comment seems mis-edited.
> * If there is a single are no contradicting rows, count the group
> * as supporting, otherwise contradicting.

maybe this would be like the following? The varialbe counting
the first "contradiction" is named "n_violations". This seems
somewhat confusing.

> * If there are no violating rows up to here, count the group
> * as supporting, otherwise contradicting.

- "/* first columns match, but the last one does not"
else if (multi_sort_compare_dims((k - 1), (k - 1), ...

The above comparison should use multi_sort_compare_dim, not
dims

- This function counts "n_contradicting_rows" but it is not
referenced. Anyway n_contradicting_rows = numrows -
n_supporing_rows so it and n_contradicting seem
unncecessary.

build_mv_dependencies():

- In the commnet,
"* covering jut 2 columns, to the largest ones, covering all columns"
"* included int the statistics. We start from the smallest ones because we"

l1: "jut" => "just", l2: "int" => "in"

mvstats.h:

- struct MVDependencyData/ MVDependenciesData

The varialbe length member at the last of the structs should
be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.

- I'm not sure how much it impacts performance, but some
struct members seems to have a bit too wide types. For
example, MVDepedenciesData.type is of int32 but it can have
only '1' for now and it won't be two-digits. Also ndeps
cannot be so large.

common.c:

multi_sort_compare_dims needs comment.

general:
This patch uses int16 as the type of attrubute number but it
might be better to use AttrNumber for the purpose.
(Specifically it seems defined as the type for an attribute
index but also used as the varialbe for number of attributes)

Sorry for the random comment in advance. I'll learn this further.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stas Kelvich 2017-01-26 11:34:47 Re: logical decoding of two-phase transactions
Previous Message Emre Hasegeli 2017-01-26 10:53:28 Re: Floating point comparison inconsistencies of the geometric types