Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-16 12:17:49
Message-ID: 0a2de724-ef48-b0b6-d2dd-7a6bde31f970@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/16/2018 12:16 AM, Tomas Vondra wrote:
>
> On 07/15/2018 04:43 PM, Dean Rasheed wrote:
>> On 15 July 2018 at 14:29, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> It's quite unclear to me how this algorithm could reliably end up with
>>> hist_sel=0 (in cases where we already don't end up with that). I mean,
>>> if a bucket matches the conditions, then the only way to eliminate is by
>>> deducing that MCV already contains all the matches - and that's rather
>>> difficult for complex clauses ...
>>
>> Ah, I didn't realise that you were using histograms for equality
>> clauses as well. I had assumed that they would only use the MCV stats,
>> as in the univariate case. Using histograms for equality seems
>> problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
>> as things stand that would end up with an estimate of half the
>> bucket's frequency, which seems excessive.
>
> Yeah, I think you're right - this is likely to produce over-estimates
> and needs rethinking. With top-level equality clauses it should be
> fairly trivial to use approach similar to the univariate case, i.e.
> computing ndistinct and using
>
> (1 - mcv_totalsel) / ndistinct
>
> If there are ndistinct coefficients this might be pretty good estimate
> of the non-MCV part, I think. But it only works for top-level
> equalities, not for complex clauses as in your examples.
>

On further thought, it's a bit more complicated, actually. Firstly, we
already do that when there's no histogram (as in your example), and
clearly it does not help. I initially thought it's a mistake to use the
histogram in this case, but I can think of cases where it helps a lot.

1) when the equality clauses match nothing

In this case we may not find any buckets possibly matching the
combination of values, producing selectivity estimate 0.0. While by
using 1/ndistinct would give us something else.

2) when there are equality and inequality clauses

Similarly to the previous case, the equality clauses are useful in
eliminating some of the buckets.

Now, I agree estimating equality clauses using histogram is tricky, so
perhaps what we should do is using them as "conditions" to eliminate
histogram buckets, but use ndistinct to estimate the selectivity. That
is something like this:

P(a=1 & b=1 & c<10 & d>=100)
= P(a=1 & b=1) * P(c<10 & d>=100 | a=1 & b=1)
= 1/ndistinct(a,b) * P(c<10 & d>=100 | a=1 & b=1)

where the second part is estimated using the histogram.

Of course, this still only works for the top-level equality clauses :-(

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-16 12:23:52 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Bruce Momjian 2018-07-16 12:13:21 Re: Finding database for pg_upgrade missing library