Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, david(at)pgmasters(dot)net, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, jeff(dot)janes(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-08-10 18:09:14
Message-ID: 53953c79-0603-003d-134e-cc4f0fd1ad64@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/10/2016 02:24 PM, Michael Paquier wrote:
> On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>> On 10/08/16 13:33, Tomas Vondra wrote:
>>>
>>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>>
>>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>>
>>>>> 2) combining multiple statistics
>>>>>
>>>>>
>>>>> I think the ability to combine multivariate statistics (covering
>>>>> different
>>>>> subsets of conditions) is important and useful, but I'm starting to
>>>>> think
>>>>> that the current implementation may not be the correct one (which is
>>>>> why I
>>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>>
>>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>>> estimating query:
>>>>>
>>>>> SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>>
>>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>>> patch does
>>>>> about this:
>>>>>
>>>>> P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>>
>>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>>> estimates
>>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>>> efficient,
>>>>> but it only works as long as the query contains conditions
>>>>> "connecting" the
>>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>>> stops working.
>>>>
>>>>
>>>> This is trying to make the algorithm smarter than the user, which is
>>>> something I'd think we could live without. In this case statistics on
>>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>>
>>>
>>> I don't think so. Obviously, if you have statistics covering all the
>>> conditions - great, we can't really do better than that.
>>>
>>> But there's a crucial relation between the number of dimensions of the
>>> statistics and accuracy of the statistics. Let's say you have statistics
>>> on 8 columns, and you split each dimension twice to build a histogram -
>>> that's 256 buckets right there, and we only get ~50% selectivity in each
>>> dimension (the actual histogram building algorithm is more complex, but
>>> you get the idea).
>>
>> I think it makes sense to pursue this, but I also think we can easily live
>> with not having it in the first version that gets committed and doing it as
>> follow-up patch.
>
> This patch is large and complicated enough. As this is not a mandatory
> piece to get a basic support, I'd suggest just to drop that for later.

Which is why combining multiple statistics is in part 0006 and all the
previous parts simply choose the single "best" statistics ;-)

I'm perfectly fine with committing just the first few parts, and leaving
0006 for the next major version.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-08-10 18:22:52 Re: Declarative partitioning - another take
Previous Message Tomas Vondra 2016-08-10 18:07:23 Re: multivariate statistics (v19)