Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-15 14:43:36
Message-ID: CAEZATCXwTYmw5dBhuShh-1kmDV3UYSbf0=+qweXH5Y3=_m1WiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15 July 2018 at 14:29, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> It's quite unclear to me how this algorithm could reliably end up with
> hist_sel=0 (in cases where we already don't end up with that). I mean,
> if a bucket matches the conditions, then the only way to eliminate is by
> deducing that MCV already contains all the matches - and that's rather
> difficult for complex clauses ...

Ah, I didn't realise that you were using histograms for equality
clauses as well. I had assumed that they would only use the MCV stats,
as in the univariate case. Using histograms for equality seems
problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
as things stand that would end up with an estimate of half the
bucket's frequency, which seems excessive. Also, if I'm reading it
correctly, the code for histograms with not-equals will return
STATS_MATCH_PARTIAL for all but one of the buckets, which isn't great
either.

> I don't know, really. I'll have to try hacking on this a bit I guess.
> But there's one obvious question - in what order should we add the
> clauses? Does it matter at all, or what is the optimal order? We don't
> need to worry about it now, because we simply consider all clauses at
> once, but I guess the proposed algorithm is more sensitive to this.

I don't know. That's definitely one of the least satisfactory parts of
that idea.

The alternative seems to be to improve the match tracking in your
current algorithm so that it keeps more detailed information about the
kinds of matches seen at each level, and combines them appropriately.
Maybe that's possible, but I'm struggling to see exactly how. Counting
equality clauses seen on each column might be a start. But it would
also need to track inequalities, with min/max values or fractions of
the non-MCV total, or some such thing.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-15 17:02:22 Re: why partition pruning doesn't work?
Previous Message David Rowley 2018-07-15 13:42:27 Re: [HACKERS] Removing LEFT JOINs in more cases