Re: multivariate statistics (v19)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-09-30 11:10:09
Message-ID: 1c7e4e63-769b-f8ce-f245-85ef4f59fcba@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This patch set is in pretty good shape, the only problem is that it's so
big that no-one seems to have the time or courage to do the final
touches and commit it. If we just focus on the functional dependencies
part for now, I think we might get somewhere. I peeked at the MCV and
histogram patches too, and I think they make total sense as well, and
are a natural extension of the functional dependencies patch. So if we
just focus on that for now, I don't think we will paint ourselves in the
corner.

(more below)

On 09/14/2016 01:01 AM, Tomas Vondra wrote:
> On 09/12/2016 04:08 PM, Dean Rasheed wrote:
>> Regarding the statistics themselves, I read the description of soft
>> functional dependencies, and I'm somewhat skeptical about that
>> algorithm. I don't like the arbitrary thresholds or the sudden jump
>> from independence to dependence and clause reduction. As others have
>> said, I think this should account for a continuous spectrum of
>> dependence from fully independent to fully dependent, and combine
>> clause selectivities in a way based on the degree of dependence. For
>> example, if you computed an estimate for the fraction 'f' of the
>> table's rows for which a -> b, then it might be reasonable to combine
>> the selectivities using
>>
>> P(a,b) = P(a) * (f + (1-f) * P(b))
>>
>
> Yeah, I agree that the thresholds resulting in sudden changes between
> "dependent" and "not dependent" are annoying. The question is whether it
> makes sense to fix that, though - the functional dependencies were meant
> as the simplest form of statistics, allowing us to get the rest of the
> infrastructure in.
>
> I'm OK with replacing the true/false dependencies with a degree of
> dependency between 0 and 1, but I'm a bit afraid it'll result in
> complaints that the first patch got too large / complicated.

+1 for using a floating degree between 0 and 1, rather than a boolean.

> It also contradicts the idea of using functional dependencies as a
> low-overhead type of statistics, filtering the list of clauses that need
> to be estimated using more expensive types of statistics (MCV lists,
> histograms, ...). Switching to a degree of dependency would prevent
> removal of "unnecessary" clauses.

That sounds OK to me, although I'm not deeply familiar with this patch yet.

>> Of course, having just a single number that tells you the columns are
>> correlated, tells you nothing about whether the clauses on those
>> columns are consistent with that correlation. For example, in the
>> following table
>>
>> CREATE TABLE t(a int, b int);
>> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>>
>> 'b' is functionally dependent on 'a' (and vice versa), but if you
>> query the rows with a<50 and with b<50, those clauses behave
>> essentially independently, because they're not consistent with the
>> functional dependence between 'a' and 'b', so the best way to combine
>> their selectivities is just to multiply them, as we currently do.
>>
>> So whilst it may be interesting to determine that 'b' is functionally
>> dependent on 'a', it's not obvious whether that fact by itself should
>> be used in the selectivity estimates. Perhaps it should, on the
>> grounds that it's best to attempt to use all the available
>> information, but only if there are no more detailed statistics
>> available. In any case, knowing that there is a correlation can be
>> used as an indicator that it may be worthwhile to build more detailed
>> multivariate statistics like a MCV list or a histogram on those
>> columns.
>
> Right. IIRC this is actually described in the README as "incompatible
> conditions". While implementing it, I concluded that this is OK and it's
> up to the developer to decide whether the queries are compatible with
> the "assumption of compatibility". But maybe this is reasoning is bogus
> and makes (the current implementation of) functional dependencies
> unusable in practice.

I think that's OK. It seems like a good assumption that the conditions
are "compatible" with the functional dependency. For two reasons:

1) A query with compatible clauses is much more likely to occur in real
life. Why would you run a query with an incompatible ZIP and city clauses?

2) If the conditions were in fact incompatible, the query is likely to
return 0 rows, and will bail out very quickly, even if the estimates are
way off and you choose a non-optimal plan. There are exceptions, of
course: an index scan might be able to conclude that there are no rows
much quicker than a seqscan, but as a general rule of thumb, a query
that returns 0 rows isn't very sensitive to the chosen plan.

And of course, as long as we're not collecting these statistics
automatically, if it doesn't work for your application, just don't
collect them.

I fear that using "statistics" as the name of the new object might get a
bit awkward. "statistics" is a plural, but we use it as the name of a
single object, like "pants" or "scissors". Not sure I have any better
ideas though. "estimator"? "statistics collection"? Or perhaps it should
be singular, "statistic". I note that you actually called the system
table "pg_mv_statistic", in singular.

I'm not a big fan of storing the stats as just a bytea blob, and having
to use special functions to interpret it. By looking at the patch, it's
not clear to me what we actually store for functional dependencies. A
list of attribute numbers? Could we store them simply as an int[]? (I'm
not a big fan of the hack in pg_statistic, that allows storing arrays of
any data type in the same column, though. But for functional
dependencies, I don't think we need that.)

Overall, this is going to be a great feature!

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emrul 2016-09-30 11:34:57 Re: Learning to hack Postgres - Keeping track of ctids
Previous Message Masahiko Sawada 2016-09-30 10:54:58 Re: [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"