Re: multivariate statistics (v25)

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Fetter <david(at)fetter(dot)org>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v25)
Date: 2017-04-05 14:47:44
Message-ID: CAKJS1f_vUck+qbxqsh4m_FNs+D4XA2KPyvvh0zYMjVg-eEHeAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 April 2017 at 14:53, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Tue, 4 Apr 2017 20:19:39 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <56f40b20-c464-fad2-ff39-06b668fac47c(at)2ndquadrant(dot)com>
>> Two minor comments:
>>
>> 1) DEPENDENCY_MIN_GROUP_SIZE
>>
>> I'm not sure we still need the min_group_size, when evaluating
>> dependencies. It was meant to deal with 'noisy' data, but I think it
>> after switching to the 'degree' it might actually be a bad idea.

Yeah, I'd wondered about this when I first started testing the patch.
I failed to get any functional dependencies because my values were too
unique. Seems I'd gotten a bit used to it, and in the end thought that
if the values are unique enough then they won't suffer as much from
the underestimation problem you're trying to solve here.

I've removed that part of the code now.

> I think the same. Quite large part of functional dependency in
> reality is in this kind.
>
>> 2) A minor detail is that instead of this
>>
>> if (estimatedclauses != NULL &&
>> bms_is_member(listidx, estimatedclauses))
>> continue;
>>
>> perhaps we should do just this:
>>
>> if (bms_is_member(listidx, estimatedclauses))
>> continue;
>>
>> bms_is_member does the same NULL check right at the beginning, so I
>> don't think this might make a measurable difference.

hmm yeah, I'd added that because I thought the estimatedclauses would
be NULL in 99.9% of cases and thought that I might be able to shave a
few cycles off. I see that there's an x < 0 test before the NULL test
in the function. Anyway, I'm not going to put up a fight here, so I've
removed it. I didn't ever benchmark anything to see if the extra test
actually helped anyway...

> I have some other comments.
>
> ======
> - The comment for clauselist_selectivity,
> | + * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
> | + * selectivity estimates using any extended statistcs on 'rel'.
>
> The 'rel' is actually a parameter but rtekind means rel->rtekind
> so this might be better be such like the following.
>
> | When a relation of RTE_RELATION is given as 'rel', we try
> | extended statistcs on the relation.
>
> Then the following line doesn't seem to be required.
>
> | + * If we identify such extended statistics exist, we try to apply them.

Yes, good point. I've revised this comment a bit now.

>
> =====
> The following comment in the same function,
>
> | + if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
> | + {
> | + /*
> | + * Try to estimate with multivariate functional dependency statistics.
> | + *
> | + * The function will supply an estimate for the clauses which it
> | + * estimated for. Any clauses which were unsuitible were ignored.
> | + * Clauses which were estimated will have their 0-based list index set
> | + * in estimatedclauses. We must ignore these clauses when processing
> | + * the remaining clauses later.
> | + */
>
> (Notice that I'm not a good writer) This might better be the
> following.
>
> | dependencies_clauselist_selectivity gives selectivity over
> | caluses that functional dependencies on the given relation is
> | applicable. 0-based index numbers of consumed clauses are
> | returned in the bitmap set estimatedclauses so that the
> | estimation here after can ignore them.

I've changed this one too now.

> =====
> | + s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
> | + jointype, sjinfo, rel, &estimatedclauses);
>
> The name prefix "dependency_" means "functional_dependency" here
> and omitting "functional" is confusing to me. On the other hand
> "functional_dependency" is quite long as prefix. Could we use
> "func_dependency" or something that is shorter but meaningful?
> (But this change causes renaming of many other sutff..)

oh no! Many functions in dependencies.c start with dependencies_. To
me, it's a bit of an OOP thing, which if we'd been using some other
language would have been dependencies->clauselist_selectivity(). Of
course, not all functions in that file follow that rule, but I don't
feel a pressing need to go make that any worse. Perhaps the prefix
could be func_dependency, but I really don't feel very excited about
having it that way, and even less so about making the change.

> =====
> The name "dependency_compatible_clause" might be meaningful if it
> were "clause_is_compatible_with_(functional_)dependency" or such.

I could maybe squeeze the word "is" in there. ... OK done.

> =====
> dependency_compatible_walker() returns true if given node is
> *not* compatible. Isn't it confusing?

Yeah.

>
> =====
> dependency_compatible_walker() seems implicitly expecting that
> RestrictInfo will be given at the first. RestrictInfo might(
> should be processed outside this function in _compatible_clause().

Actually, I don't really see a great need for this to be a recursive
walker type function. So I've just gone and stuck all that logic in
dependency_is_compatible_clause() instead.

> =====
> dependency_compatible_walker() can return two or more attriburtes
> but dependency_compatible_clause() errors out in the case. Since
> _walker is called only from the _clause, _walker can return
> earlier with "incompatible" in such a case.

I don't quite see how it's possible for it to ever have more than 1
attnum in there. We only capture Vars from one side of a binary
OpExpr. If one side of the OpExpr is an Expr, then we'd not capture
anything, and not recurse into the Expr. Anyway, I've pulled that code
out into dependency_is_compatible_clause now.

> =====
> In the comment in dependencies_clauselist_selectivity(),
>
> | /*
> | * Technically we could find more than one clause for a given
> | * attnum. Since these clauses must be equality clauses, we choose
> | * to only take the selectivity estimate from the final clause in
> | * the list for this attnum. If the attnum happens to be compared
> | * to a different Const in another clause then no rows will match
> | * anyway. If it happens to be compared to the same Const, then
> | * ignoring the additional clause is just the thing to do.
> | */
> | if (dependency_implies_attribute(dependency,
> | list_attnums[listidx]))
>
> If multiple clauses include the attribute, selectivity estimates
> for clauses other than the last one are waste of time. Why not the
> first one but the last one?

Why not the middle one? Really it's not expected to be a common case.
If someone writes: WHERE a = 1 and a = 2; then they'll likely not get
many results back. If the same clause is duplicated then well, it
won't be the only thing that does a little needless extra work. I
don't think optimising for this is worth the trouble.

>
> Even if all clauses should be added into estimatedclauses,
> calling clause_selectivity once is enough. Since
> clause_selectivity may return 1.0 for some clauses, using s2 for
> the decision seems reasonable.
>
> | if (dependency_implies_attribute(dependency,
> | list_attnums[listidx]))
> | {
> | clause = (Node *) lfirst(l);
> + if (s2 == 1.0)
> | s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo,
>
> # This '==' works since it is not a result of a calculation.

I don't think this is an important optimisation. It's a corner case if
more than one match, although not impossible. I vote to leave it as
is, and not optimise the corner case.

> =====
> Still in dependencies_clauselist_selectivity,
> dependency_implies_attributes seems designed to return true for
> at least one clause in the clauses but any failure leands to
> infinite loop. I think any measure against the case is required.

I did consider this, but I really can't see a scenario that this is
possible. find_strongest_dependency() would not have found a
dependency if dependency_implies_attribute() was going to fail, so
we'd have exited the loop already. I think it's safe providing that
'clauses_attnums' is in sync with the clauses that we'll examine in
the loop over the 'clauses' list. Perhaps the while loop should have
some safety valve, but I'm not all that sure what that would be, and
since I can't see how it could become an infinite loop, I've not
bothered to think too hard about what else might be done here.

I've attached an updated patch to address Tomas' concerns and yours too.

Thank you to both for looking at my changes

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
mv_functional-deps_2017-04-06.patch application/octet-stream 94.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-04-05 14:48:27 Re: partitioned tables and contrib/sepgsql
Previous Message Andres Freund 2017-04-05 14:45:21 Re: Rewriting the test of pg_upgrade as a TAP test