Re: multivariate statistics (v19)

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-09-12 14:08:01
Message-ID: CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 August 2016 at 02:58, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Attached is v19 of the "multivariate stats" patch series

Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name
[ WITH (options) ]
ON expression [, ...]
FROM table [, ...]
WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name
[ WITH (options) ]
ON column1, column2 [, ...]
FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.

Of course, I'm not suggesting that the current patch do any of that --
it's big enough as it is. I'm just throwing out possible future
directions this might go in, so that we don't get painted into a
corner when designing the syntax for the current patch.

Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

P(a,b) = P(a) * (f + (1-f) * P(b))

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.

Looking at the ndistinct coefficient 'q', I think it would be better
if the recorded statistic were just the estimate for
ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
more fundamental statistic, and it's easier to document and easier to
interpret. Also, I don't believe that the coefficient 'q' is the right
number to use for clause estimation:

Looking at README.ndistinct, I'm skeptical about the selectivity
estimation argument. In the case where a -> b, you'd have q =
ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
is fine for a uniform distribution. But typically, there would be
univariate statistics on a and b, so if for example a=1 were 100x more
likely than average, you'd probably know that and the existing code
computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
= 1/ndistinct(a) would be a significant underestimate, since it would
be ignoring known information about the distribution of a.

But likewise if, as is later argued, you were to use 'q' as a
correction factor applied to the individual clause selectivities, you
could end up with significant overestimates: if you said P(a=1 & b=2)
= q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
a -> b, then b=2 would also be 100x more likely than average (assuming
that b=2 was the value implied by the functional dependency), and that
would also be reflected in the univariate statics on b, so then you'd
end up with an overall selectivity of around 10000/ndistinct(a), which
would be 100x too big. In fact, since a -> b means that q =
ndistinct(b), there's a good chance of hitting data for which q * P(b)
is greater than 1, so this formula would lead to a combined
selectivity greater than P(a), which is obviously nonsense.

Having a better estimate for ndistinct(a,b,...) looks very useful by
itself for GROUP BY estimation, and there may be other places that
would benefit from it, but I don't think it's the best statistic for
determining functional dependence or combining clause selectivities.

That's as much as I've looked at so far. It's such a big patch that
it's difficult to consider all at once. I think perhaps the smallest
committable self-contained unit providing a tangible benefit would be
something containing the core infrastructure plus the ndistinct
estimate and the improved GROUP BY estimation.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-09-12 14:19:14 Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Previous Message Jesper Pedersen 2016-09-12 13:20:43 Re: Hash Indexes