Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-12 07:49:22
Message-ID: CAEZATCWWLvp0V+VmAA02dx1jEcJUO=GOMh3ybHEsvSYmxT62sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 11 Jan 2019, 21:18 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com wrote:

>
> On 1/10/19 4:20 PM, Dean Rasheed wrote:
> > ...
> >
> > So perhaps what we should do for multivariate stats is simply use the
> > relative standard error approach (i.e., reuse the patch in [2] with a
> > 20% RSE cutoff). That had a lot of testing at the time, against a wide
> > range of data distributions, and proved to be very good, not to
> > mention being very simple.
> >
> > That approach would encompass both groups more and less common than
> > the base frequency, because it relies entirely on the group appearing
> > enough times in the sample to infer that any errors on the resulting
> > estimates will be reasonably well controlled. It wouldn't actually
> > look at the base frequency at all in deciding which items to keep.
> >
>
> I've been looking at this approach today, and I'm a bit puzzled. That
> patch essentially uses SRE to compute mincount like this:
>
> mincount = n*(N-n) / (N-n+0.04*n*(N-1))
>
> and then includes all items more common than this threshold.

Right.

How could
> that handle items significantly less common than the base frequency?
>

Well what I meant was that it will *allow* items significantly less common
than the base frequency, because it's not even looking at the base
frequency. For example, if the table size were N=100,000 and we sampled
n=10,000 rows from that, mincount would work out as 22. So it's easy to
construct allowed items more common than that and still significantly less
common than their base frequency.

A possible refinement would be to say that if there are more than
stats_target items more common than this mincount threshold, rather than
excluding the least common ones to get the target number of items, exclude
the ones closest to their base frequencies, on the grounds that those are
the ones for which the MCV stats will make the least difference. That might
complicate the code somewhat though -- I don't have it in front of me, so I
can't remember if it even tracks more than stats_target items.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-01-12 08:49:39 Re: speeding up planning with partitions
Previous Message leif 2019-01-12 07:40:07 Re: BUG #15589: Due to missing wal, restore ends prematurely and opens database for read/write