Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-03-29 13:36:09
Message-ID: 7d12b5d1-ac77-67e3-c67e-7fcbd3753e9c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/29/2018 02:27 AM, Dean Rasheed wrote:
> On 28 March 2018 at 15:50, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> After thinking about this a bit more, I'm not sure if updating the info
>> based on recursive calls makes sense. The fullmatch flag was supposed to
>> answer a simple question - can there be just a single matching item?
>>
>> If there are equality conditions on all columns, there can be just a
>> single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
>> then we don't need to inspect the non-MCV part.
>>
>> But handling this in recursive manner breaks this entirely, because with
>> something like
>>
>> (a=1) AND (b=1 OR b=2)
>>
>> you suddenly can have multiple matching items. Which makes the fullmatch
>> flag somewhat useless.
>>
>> So I think we should be looking at top-level equality clauses only, just
>> like number_of_groups() does.
>>
>
> I'm not quite sure what you mean by that, but it sounds a bit limiting
> in terms of the kinds of user queries that would be supported.
>

Let me explain. The question is "Can there be just a single combination
of values matching the conditions?" which (if true) allows us to produce
better estimates. If we found a match in the MCV, we don't need to look
at the non-MCV part. If not found in the MCV, we can compute an average
selectivity as 1/ndistinct (possibly using the ndistinct coefficients).

If we can't deduce the existence of a single possible match, we have to
compute an estimate in a more generic way.

With (a=1 AND b=1) and stats on (a,b) there's just a single possible
match (1,1), so that's fine. But it does not work once we start looking
for equalities nested deeper - for example (a=1 AND (b=1 OR b=2)) can be
translated as ((a=1 AND b=1) OR (a=1 AND b=2)) so technically there's an
equality on each column, but there are two possible matches (1,1) and
(1,2). So the optimization does not work.

Does that clarify what I meant?

Although, perhaps we could improve this by deducing number of possible
matches and then track matching items in the MCV list. But that seems
quite a bit harder.

(Of course, we need to consider the non-equality clauses in both cases,
the WIP patch does not do that yet.)

>
>> I think we can remove the fullmatch flag from mcv_update_bitmap
>> entirely. All we need to know is the presence of equality clauses and
>> whether there was a match in MCV (which we know from s1 > 0.0).
>>
>
> I agree with removing the fullmatch flag, but I don't think we
> actually need to know about the presence of equality clauses:
>
> The way that mcv_update_bitmap() recursively computes the set of
> matching MCVs seems to be correct. That gives us a value (call it
> mcv_matchsel) for the proportion of the table's rows that are in the
> MCV list and satisfy the clauses in stat_clauses.
>

Sure, but the extra bit of information allows us to (a) ignore the
non-MCV part and (b) apply the 1/ndistinct estimate.

> We can also estimate that there are (1-mcv_totalsel)*N rows that are
> not in the MCV list, for which the MCV stats therefore tell us
> nothing. The best way to estimate those rows would seem to be to use
> the logic from the guts of clauselist_selectivity(), without
> consulting any extended MCV stats (but still consulting other extended
> stats, I think). Doing that would return a selectivity value (call it
> nonmcv_sel) for those remaining rows. Then a reasonable estimate for
> the overall selectivity would seem to be
>
> mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel
>
> and there would be no need for mcv_update_bitmap() to track eqmatches
> or return fullmatch, and it wouldn't actually matter whether or not we
> had equality clauses or if all the MCV columns were used.
>

Right, although I'm not sure about fallback to clauselist_selectivity()
which kinda throws away the statistical dependency.

That's why I think we should use 1/ndistinct for equality clauses, and
then perhaps leverage the MCV for non-equality clauses somehow.

It just occurred we can apply the 1/ndistinct estimate for equalities
even when we it's not a 'fullmatch'.

So what I propose is roughly this

1) compute selectivity "mcv_sel" using MCV

2) see if there can be just a single match, and (mcv_sel > 0) - if yes,
we're done and we don't need to look at non-MCV part

3) split the clauses into top-level equality clauses and the rest

4) estimate "equal_sel" for equality clauses using 1/ndistinct

5) estimate the "inequal_sel" for remaining clauses using MCV (assumes
the selectivity will be the same on non-MCV part)

6) total selectivity is

mcv_sel + (1 - mcv_totalsel) * equal_sel * inequal_sel

We may need to fall back to clauselist_selectivity() in some cases, of
course, but I think we should leverage the MCV as much as possible.

Another thing is that some of this will change once the histograms are
considered, which helps with estimating the non-MCV part.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-03-29 13:40:03 Re: Function to track shmem reinit time
Previous Message Teodor Sigaev 2018-03-29 13:35:39 Re: Cast jsonb to numeric, int, float, bool