Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-15 22:16:57
Message-ID: 5721b874-35f7-89aa-6c17-4f9a0e8063d0@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 07/15/2018 04:43 PM, Dean Rasheed wrote:
> On 15 July 2018 at 14:29, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> It's quite unclear to me how this algorithm could reliably end up with
>> hist_sel=0 (in cases where we already don't end up with that). I mean,
>> if a bucket matches the conditions, then the only way to eliminate is by
>> deducing that MCV already contains all the matches - and that's rather
>> difficult for complex clauses ...
>
> Ah, I didn't realise that you were using histograms for equality
> clauses as well. I had assumed that they would only use the MCV stats,
> as in the univariate case. Using histograms for equality seems
> problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
> as things stand that would end up with an estimate of half the
> bucket's frequency, which seems excessive.

Yeah, I think you're right - this is likely to produce over-estimates
and needs rethinking. With top-level equality clauses it should be
fairly trivial to use approach similar to the univariate case, i.e.
computing ndistinct and using

(1 - mcv_totalsel) / ndistinct

If there are ndistinct coefficients this might be pretty good estimate
of the non-MCV part, I think. But it only works for top-level
equalities, not for complex clauses as in your examples.

While looking at the statext_clauselist_selectivity code I think I see
two more bugs:

1) the histogram_clauselist_selectivity() should probably use
'stat_clauses' and not 'clauses'

2) the early returns fail to estimate the neqclauses

It's a bit too late here but I'll look at it tomorrow.

> Also, if I'm reading it correctly, the code for histograms with
> not-equals will return STATS_MATCH_PARTIAL for all but one of the
> buckets, which isn't great either.
>

Ummm, why?

>
>> I don't know, really. I'll have to try hacking on this a bit I guess.
>> But there's one obvious question - in what order should we add the
>> clauses? Does it matter at all, or what is the optimal order? We don't
>> need to worry about it now, because we simply consider all clauses at
>> once, but I guess the proposed algorithm is more sensitive to this.
>
> I don't know. That's definitely one of the least satisfactory parts of
> that idea.
>
> The alternative seems to be to improve the match tracking in your
> current algorithm so that it keeps more detailed information about the
> kinds of matches seen at each level, and combines them appropriately.
> Maybe that's possible, but I'm struggling to see exactly how. Counting
> equality clauses seen on each column might be a start. But it would
> also need to track inequalities, with min/max values or fractions of
> the non-MCV total, or some such thing.
>

Yeah, I agree, I'm not happy with this part either. But I'm grateful
there's someone else thinking about the issues and proposing alternative
approaches. Thanks for doing that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-15 22:48:43 Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Previous Message Tom Lane 2018-07-15 20:41:35 Re: Usage of epoch in txid_current