Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-13 00:04:00
Message-ID: e70e4642-9aca-12fe-c1d6-f3d90211bdbb@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/12/19 8:49 AM, Dean Rasheed wrote:
> On Fri, 11 Jan 2019, 21:18 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> On 1/10/19 4:20 PM, Dean Rasheed wrote:
> > ...
> >
> > So perhaps what we should do for multivariate stats is simply use the
> > relative standard error approach (i.e., reuse the patch in [2] with a
> > 20% RSE cutoff). That had a lot of testing at the time, against a wide
> > range of data distributions, and proved to be very good, not to
> > mention being very simple.
> >
> > That approach would encompass both groups more and less common than
> > the base frequency, because it relies entirely on the group appearing
> > enough times in the sample to infer that any errors on the resulting
> > estimates will be reasonably well controlled. It wouldn't actually
> > look at the base frequency at all in deciding which items to keep.
> >
>
> I've been looking at this approach today, and I'm a bit puzzled. That
> patch essentially uses SRE to compute mincount like this:
>
>     mincount = n*(N-n) / (N-n+0.04*n*(N-1))
>
> and then includes all items more common than this threshold.
>
>
> Right.
>
> How could
> that handle items significantly less common than the base frequency?
>
>
> Well what I meant was that it will *allow* items significantly less
> common than the base frequency, because it's not even looking at the
> base frequency. For example, if the table size were N=100,000 and we
> sampled n=10,000 rows from that, mincount would work out as 22. So it's
> easy to construct allowed items more common than that and still
> significantly less common than their base frequency.
>

OK, understood. I agree that's a sensible yet simple approach, so I've
adopted it in the next version of the patch.

> A possible refinement would be to say that if there are more than
> stats_target items more common than this mincount threshold, rather than
> excluding the least common ones to get the target number of items,
> exclude the ones closest to their base frequencies, on the grounds that
> those are the ones for which the MCV stats will make the least
> difference. That might complicate the code somewhat though -- I don't
> have it in front of me, so I can't remember if it even tracks more than
> stats_target items.
>

Yes, the patch does limit the number of items to stats_target (a maximum
of per-attribute stattarget values, to be precise). IIRC that's a piece
you've added sometime last year ;-)

I've been experimenting with removing items closest to base frequencies
today, and I came to the conclusion that it's rather tricky for a couple
of reasons.

1) How exactly do you measure "closeness" to base frequency? I've tried
computing the error in different ways, including:

* Max(freq/base, base/freq)
* abs(freq - base)

but this does not seem to affect the behavior very much, TBH.

2) This necessarily reduces mcv_totalsel, i.e. it increases the part not
covered by MCV. And estimates on this part are rather crude.

3) It does nothing for "impossible" items, i.e. combinations that do not
exist at all. Clearly, those won't be part of the sample, and so can't
be included in the MCV no matter which error definition we pick. And for
very rare combinations it might lead to sudden changes, depending on
whether the group gets sampled or not.

So IMHO it's better to stick to the simple SRE approach for now.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-01-13 00:31:30 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Chapman Flack 2019-01-12 23:38:56 Re: PostgreSQL vs SQL/XML Standards