Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-16 12:23:52
Message-ID: f0625b09-bf50-5703-6241-51aea96a3f6a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/15/2018 11:36 AM, Dean Rasheed wrote:
> On 13 July 2018 at 18:27, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I'm not so sure. The issue is that a lot of the MCV deductions depends
>> on whether we can answer questions like "Is there a single match?" or
>> "If we got a match in MCV, do we need to look at the non-MCV part?" This
>> is not very different from the single-column estimates, except of course
>> here we need to look at multiple columns.
>>
>> The top-level clauses allow us to make such deductions, with deeper
>> clauses it's much more difficult (perhaps impossible). Because for
>> example with (a=1 AND b=1) there can be just a single match, so if we
>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>> b=2)) it's not that simple, because there may be multiple combinations
>> and so a match in MCV does not guarantee anything.
>
> Actually, it guarantees a lower bound on the overall selectivity, and
> maybe that's the best that we can do in the absence of any other
> stats.
>

Hmmm, is that actually true? Let's consider a simple example, with two
columns, each with just 2 values, and a "perfect" MCV list:

a | b | frequency
-------------------
1 | 1 | 0.5
2 | 2 | 0.5

And let's estimate sel(a=1 & b=2). Your proposed algorithm does this:

1) sel(a=1) = 0.5
2) sel(b=2) = 0.5
3) total_sel = sel(a=1) * sel(b=2) = 0.25
4) mcv_sel = 0.0
5) total_sel = Max(total_sel, mcv_sel) = 0.25

How is that a lower bound? Or what is it lower than?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-16 12:28:54 Re: Parallel queries in single transaction
Previous Message Tomas Vondra 2018-07-16 12:17:49 Re: [HACKERS] PATCH: multivariate histograms and MCV lists