Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-09-13 22:01:31
Message-ID: 0ce73e37-7be4-8d9f-1ec4-46b0ca1d90c6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:
> On 3 August 2016 at 02:58, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Attached is v19 of the "multivariate stats" patch series
>
> Hi,
>
> I started looking at this - just at a very high level - I've not read
> much of the detail yet, but here are some initial review comments.
>
> I think the overall infrastructure approach for CREATE STATISTICS
> makes sense, and I agree with other suggestions upthread that it would
> be useful to be able to build statistics on arbitrary expressions,
> although that doesn't need to be part of this patch, it's useful to
> keep that in mind as a possible future extension of this initial
> design.
>
> I can imagine it being useful to be able to create user-defined
> statistics on an arbitrary list of expressions, and I think that would
> include univariate as well as multivariate statistics. Perhaps that's
> something to take into account in the naming of things, e.g., as David
> Rowley suggested, something like pg_statistic_ext, rather than
> pg_mv_statistic.
>
> I also like the idea that this might one day be extended to support
> statistics across multiple tables, although I think that might be
> challenging to achieve -- you'd need a method of taking a random
> sample of rows from a join between 2 or more tables. However, if the
> intention is to be able to support that one day, I think that needs to
> be accounted for in the syntax now -- specifically, I think it will be
> too limiting to only support things extending the current syntax of
> the form table1(col1, col2, ...), table2(col1, col2, ...), because
> that precludes building statistics on an expression referring to
> columns from more than one table. So I think we should plan further
> ahead and use a syntax giving greater flexibility in the future, for
> example something structured more like a query (like CREATE VIEW):
>
> CREATE STATISTICS name
> [ WITH (options) ]
> ON expression [, ...]
> FROM table [, ...]
> WHERE condition
>
> where the first version of the patch would only support expressions
> that are simple column references, and would require at least 2 such
> columns from a single table with no WHERE clause, i.e.:
>
> CREATE STATISTICS name
> [ WITH (options) ]
> ON column1, column2 [, ...]
> FROM table
>
> For multi-table statistics, a WHERE clause would typically be needed
> to specify how the tables are expected to be joined, but potentially
> such a clause might also be useful in single-table statistics, to
> build partial statistics on a commonly queried subset of the table,
> just like a partial index.

Hmm, the "partial statistics" idea seems interesting, It would allow us
to provide additional / more detailed statistics only for a subset of a
table.

I'm however not sure about the join case - how would the syntax work
with outer joins? But as you said, we only need

CREATE STATISTICS name
[ WITH (options) ]
ON (column1, column2 [, ...])
FROM table
WHERE condition

until we add support for join statistics.

>
> Regarding the statistics themselves, I read the description of soft
> functional dependencies, and I'm somewhat skeptical about that
> algorithm. I don't like the arbitrary thresholds or the sudden jump
> from independence to dependence and clause reduction. As others have
> said, I think this should account for a continuous spectrum of
> dependence from fully independent to fully dependent, and combine
> clause selectivities in a way based on the degree of dependence. For
> example, if you computed an estimate for the fraction 'f' of the
> table's rows for which a -> b, then it might be reasonable to combine
> the selectivities using
>
> P(a,b) = P(a) * (f + (1-f) * P(b))
>

Yeah, I agree that the thresholds resulting in sudden changes between
"dependent" and "not dependent" are annoying. The question is whether it
makes sense to fix that, though - the functional dependencies were meant
as the simplest form of statistics, allowing us to get the rest of the
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of
dependency between 0 and 1, but I'm a bit afraid it'll result in
complaints that the first patch got too large / complicated.

It also contradicts the idea of using functional dependencies as a
low-overhead type of statistics, filtering the list of clauses that need
to be estimated using more expensive types of statistics (MCV lists,
histograms, ...). Switching to a degree of dependency would prevent
removal of "unnecessary" clauses.

> Of course, having just a single number that tells you the columns are
> correlated, tells you nothing about whether the clauses on those
> columns are consistent with that correlation. For example, in the
> following table
>
> CREATE TABLE t(a int, b int);
> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>
> 'b' is functionally dependent on 'a' (and vice versa), but if you
> query the rows with a<50 and with b<50, those clauses behave
> essentially independently, because they're not consistent with the
> functional dependence between 'a' and 'b', so the best way to combine
> their selectivities is just to multiply them, as we currently do.
>
> So whilst it may be interesting to determine that 'b' is functionally
> dependent on 'a', it's not obvious whether that fact by itself should
> be used in the selectivity estimates. Perhaps it should, on the
> grounds that it's best to attempt to use all the available
> information, but only if there are no more detailed statistics
> available. In any case, knowing that there is a correlation can be
> used as an indicator that it may be worthwhile to build more detailed
> multivariate statistics like a MCV list or a histogram on those
> columns.
>

Right. IIRC this is actually described in the README as "incompatible
conditions". While implementing it, I concluded that this is OK and it's
up to the developer to decide whether the queries are compatible with
the "assumption of compatibility". But maybe this is reasoning is bogus
and makes (the current implementation of) functional dependencies
unusable in practice.

But I like the idea of reverting the order from

(a) look for functional dependencies
(b) reduce the clauses using functional dependencies
(c) estimate the rest using multivariate MCV/histograms

to

(a) estimate the rest using multivariate MCV/histograms
(b) try to apply functional dependencies on the remaining clauses

It contradicts the idea of functional dependencies as "low-overhead
statistics" but maybe it's worth it.

>
> Looking at the ndistinct coefficient 'q', I think it would be better
> if the recorded statistic were just the estimate for
> ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
> more fundamental statistic, and it's easier to document and easier to
> interpret. Also, I don't believe that the coefficient 'q' is the right
> number to use for clause estimation:
>

IIRC the reason why I stored the coefficient instead of the ndistinct()
values is that the coefficients are not directly related to number of
rows in the original relation, so you can apply it directly to whatever
cardinality estimate you have.

Otherwise it's mostly the same information - it's trivial to compute one
from the other.

>
> Looking at README.ndistinct, I'm skeptical about the selectivity
> estimation argument. In the case where a -> b, you'd have q =
> ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
> is fine for a uniform distribution. But typically, there would be
> univariate statistics on a and b, so if for example a=1 were 100x more
> likely than average, you'd probably know that and the existing code
> computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
> = 1/ndistinct(a) would be a significant underestimate, since it would
> be ignoring known information about the distribution of a.
>
> But likewise if, as is later argued, you were to use 'q' as a
> correction factor applied to the individual clause selectivities, you
> could end up with significant overestimates: if you said P(a=1 & b=2)
> = q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
> a -> b, then b=2 would also be 100x more likely than average (assuming
> that b=2 was the value implied by the functional dependency), and that
> would also be reflected in the univariate statics on b, so then you'd
> end up with an overall selectivity of around 10000/ndistinct(a), which
> would be 100x too big. In fact, since a -> b means that q =
> ndistinct(b), there's a good chance of hitting data for which q * P(b)
> is greater than 1, so this formula would lead to a combined
> selectivity greater than P(a), which is obviously nonsense.

Well, yeah. The

P(a=1) = 1/ndistinct(a)

was really just a simplification for the uniform distribution, and
looking at "q" as a correction factor is much more practical - no doubt
about that.

As for the overestimated and underestimates - I don't think we can
entirely prevent that. We're essentially replacing one assumption (AVIA)
with other assumptions (homogenity for ndistinct, compatibility for
functional dependencies), hoping that those assumptions are weaker in
some sense. But there'll always be cases that break those assumptions
and I don't think we can prevent that.

Unlike the functional dependencies, this "homogenity" assumption is not
dependent on the queries at all, so it should be possible to verify it
during ANALYZE.

Also, maybe we could/should use the same approach as for functional
dependencies, i.e. try using more detailed statistics first and then
apply ndistinct coefficients only on the remaining clauses?

>
> Having a better estimate for ndistinct(a,b,...) looks very useful by
> itself for GROUP BY estimation, and there may be other places that
> would benefit from it, but I don't think it's the best statistic for
> determining functional dependence or combining clause selectivities.
>

Not sure. I think it may be very useful type of statistics, but I'm not
going to fight for this very hard. I'm fine with ignoring this
statistics type for now, getting the other "detailed" statistics types
(MCV, histograms) in and then revisiting this.

> That's as much as I've looked at so far. It's such a big patch that
> it's difficult to consider all at once. I think perhaps the smallest
> committable self-contained unit providing a tangible benefit would be
> something containing the core infrastructure plus the ndistinct
> estimate and the improved GROUP BY estimation.
>

FWIW I find the ndistinct statistics as rather uninteresting (at least
compared to the other types of statistics), which is why it's the last
patch in the patch series. Perhaps I shouldn't have include it at all,
as it's just a distraction.

regards
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-09-13 22:06:23 Re: kqueue
Previous Message Michael Paquier 2016-09-13 21:54:30 Re: An extra error for client disconnection on Windows