Re: multivariate statistics v14

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: ishii(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics v14
Date: 2016-03-22 12:46:15
Message-ID: 20160322.214615.1524680563827969904.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>>> PostgreSQL users, and you are not too confident about 0006 or
>>>> beyond. Then I would think it is a little bit hard to justify in
>>>> putting 000[2-5] into 9.6. I really like this feature and would
>>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>>> know if there's some reaons we should put the patches into
>>>> PostgreSQL now.
>>>
>>> I don't think so. While being able to combine multiple statistics
>>> is certainly useful, I'm convinced that the initial patched add
>>> enough
>>
>> Can you please elaborate a little bit more how combining multiple
>> statistics is useful?
>
> Sure.
>
> The goal of multivariate statistics is to approximate a probability
> distribution on a group of columns. The larger the number of columns,
> the less accurate the statistics will be (with respect to individual
> columns), assuming fixed size of the sample in ANALYZE, and fixed
> statistics size.
>
> For example, if you add a column to multivariate histogram, you'll do
> some "bucket splits" by this dimension, thus reducing the accuracy for
> the other columns. You may of course allow larger statistics
> (e.g. histograms with more buckets), but that also requires larger
> samples, and so on.
>
> Now, let's assume you have a query like this:
>
> WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)
>
> and that "a" and "b" are correlated, and "c" and "d" are correlated,
> but that otherwise the columns are independent. It'd be a bit silly to
> require building statistics on (a,b,c,d), when two statistics on each
> of the column pairs would be cheaper and also more accurate.
>
> That's of course a trivial case - independent groups of correlated
> columns. But I'd say this is actually a pretty common case, and I do
> believe there's not much controversy that we should support it.
>
> Another reason to allow multiple statistics is that columns in one
> group may be a good fit for MCV list (which works well for discrete
> values), while the other group may be a good candidate for histogram
> (which works well for continuous values). This can't be solved by
> first building a MCV and then a histogram on the group.
>
> The question of course is what to do if the groups are not
> independent. The patch does that by assuming the statistics overlap,
> and uses conditions on the columns included in both statistics to
> combine them using conditional probabilities. I do believe this works
> quite well, but this is perhaps the part that needs further
> discussion. There are other ways to combine the statistics, but I do
> expect them to be considerably more expensive.
>
> Is this a sufficient explanation?
>
> Of course, there's a fair amount of additional complexity that I have
> not mentioned here (e.g. selecting the right combination of stats).

Sorry, maybe I did not explain clearyly. My question is, if put
patches only 0002 to 0005 into 9.6, does it still give any visible
benefit to users?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-03-22 12:49:48 Re: Speed up Clog Access by increasing CLOG buffers
Previous Message Tomas Vondra 2016-03-22 12:37:24 Re: multivariate statistics v14