Re: multivariate statistics (v25)

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: david(dot)rowley(at)2ndquadrant(dot)com, alvherre(at)2ndquadrant(dot)com, david(at)fetter(dot)org, dean(dot)a(dot)rasheed(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v25)
Date: 2017-04-05 02:53:51
Message-ID: 20170405.115351.110755414.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <56f40b20-c464-fad2-ff39-06b668fac47c(at)2ndquadrant(dot)com>
> On 04/04/2017 09:55 AM, David Rowley wrote:
> > On 1 April 2017 at 04:25, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
> > wrote:
> >> I've attached an updated patch.
> >
> > I've made another pass at this and ended up removing the tryextstats
> > variable. We now only try to use extended statistics when
> > clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> > RTE_RELATION, and of course, it must also have some extended stats
> > defined too.
> >
> > I've also cleaned up a few more comments, many of which I managed to
> > omit updating when I refactored how the selectivity estimates ties
> > into clauselist_selectivity()
> >
> > I'm quite happy with all of this now, and would also be happy for
> > other people to take a look and comment.
> >
> > As a reviewer, I'd be marking this ready for committer, but I've moved
> > a little way from just reviewing this now, having spent two weeks
> > hacking at it.
> >
> > The latest patch is attached.
> >
>
> Thanks David, I agree the reworked patch is much cleaner that the last
> version I posted. Thanks for spending your time on it.
>
> Two minor comments:
>
> 1) DEPENDENCY_MIN_GROUP_SIZE
>
> I'm not sure we still need the min_group_size, when evaluating
> dependencies. It was meant to deal with 'noisy' data, but I think it
> after switching to the 'degree' it might actually be a bad idea.
>
> Consider this:
>
> create table t (a int, b int);
> insert into t select 1, 1 from generate_series(1, 10000) s(i);
> insert into t select i, i from generate_series(2, 20000) s(i);
> create statistics s with (dependencies) on (a,b) from t;
> analyze t;
>
> select stadependencies from pg_statistic_ext ;
> stadependencies
> --------------------------------------------
> [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
> (1 row)
>
> So the degree of the dependency is just ~0.333 although it's obviously
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The
> reason is that we discard 2/3 of rows, because those groups are only a
> single row each, except for the one large group (1/3 of rows).
>
> Without the mininum group size limitation, the dependencies are:
>
> test=# select stadependencies from pg_statistic_ext ;
> stadependencies
> --------------------------------------------
> [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
> (1 row)
>
> which seems way more reasonable, I think.

I think the same. Quite large part of functional dependency in
reality is in this kind.

> 2) A minor detail is that instead of this
>
> if (estimatedclauses != NULL &&
> bms_is_member(listidx, estimatedclauses))
> continue;
>
> perhaps we should do just this:
>
> if (bms_is_member(listidx, estimatedclauses))
> continue;
>
> bms_is_member does the same NULL check right at the beginning, so I
> don't think this might make a measurable difference.

I have some other comments.

======
- The comment for clauselist_selectivity,
| + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
| + * selectivity estimates using any extended statistcs on 'rel'.

The 'rel' is actually a parameter but rtekind means rel->rtekind
so this might be better be such like the following.

| When a relation of RTE_RELATION is given as 'rel', we try
| extended statistcs on the relation.

Then the following line doesn't seem to be required.

| + * If we identify such extended statistics exist, we try to apply them.

=====
The following comment in the same function,

| + if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
| + {
| + /*
| + * Try to estimate with multivariate functional dependency statistics.
| + *
| + * The function will supply an estimate for the clauses which it
| + * estimated for. Any clauses which were unsuitible were ignored.
| + * Clauses which were estimated will have their 0-based list index set
| + * in estimatedclauses. We must ignore these clauses when processing
| + * the remaining clauses later.
| + */

(Notice that I'm not a good writer) This might better be the
following.

| dependencies_clauselist_selectivity gives selectivity over
| caluses that functional dependencies on the given relation is
| applicable. 0-based index numbers of consumed clauses are
| returned in the bitmap set estimatedclauses so that the
| estimation here after can ignore them.

=====
| + s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
| + jointype, sjinfo, rel, &estimatedclauses);

The name prefix "dependency_" means "functional_dependency" here
and omitting "functional" is confusing to me. On the other hand
"functional_dependency" is quite long as prefix. Could we use
"func_dependency" or something that is shorter but meaningful?
(But this change causes renaming of many other sutff..)

=====
The name "dependency_compatible_clause" might be meaningful if it
were "clause_is_compatible_with_(functional_)dependency" or such.

=====
dependency_compatible_walker() returns true if given node is
*not* compatible. Isn't it confusing?

=====
dependency_compatible_walker() seems implicitly expecting that
RestrictInfo will be given at the first. RestrictInfo might
should be processed outside this function in _compatible_clause().

=====
dependency_compatible_walker() can return two or more attriburtes
but dependency_compatible_clause() errors out in the case. Since
_walker is called only from the _clause, _walker can return
earlier with "incompatible" in such a case.

=====
In the comment in dependencies_clauselist_selectivity(),

| /*
| * Technically we could find more than one clause for a given
| * attnum. Since these clauses must be equality clauses, we choose
| * to only take the selectivity estimate from the final clause in
| * the list for this attnum. If the attnum happens to be compared
| * to a different Const in another clause then no rows will match
| * anyway. If it happens to be compared to the same Const, then
| * ignoring the additional clause is just the thing to do.
| */
| if (dependency_implies_attribute(dependency,
| list_attnums[listidx]))

If multiple clauses include the attribute, selectivity estimates
for clauses other than the last one are waste of time. Why not the
first one but the last one?

Even if all clauses should be added into estimatedclauses,
calling clause_selectivity once is enough. Since
clause_selectivity may return 1.0 for some clauses, using s2 for
the decision seems reasonable.

| if (dependency_implies_attribute(dependency,
| list_attnums[listidx]))
| {
| clause = (Node *) lfirst(l);
+ if (s2 == 1.0)
| s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo,

# This '==' works since it is not a result of a calculation.

=====
Still in dependencies_clauselist_selectivity,
dependency_implies_attributes seems designed to return true for
at least one clause in the clauses but any failure leands to
infinite loop. I think any measure against the case is required.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vitaly Burovoy 2017-04-05 02:53:56 Re: identity columns
Previous Message Ashutosh Bapat 2017-04-05 02:50:14 Re: Partition-wise join for join between (declaratively) partitioned tables