Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-01-30 19:33:26
Message-ID: 026e37cf-319a-fa6f-79c2-d5a2fc98b1b6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/26/2017 10:43 AM, Dilip Kumar wrote:
>
> histograms
> --------------
> + if (matches[i] == MVSTATS_MATCH_FULL)
> + s += mvhist->buckets[i]->ntuples;
> + else if (matches[i] == MVSTATS_MATCH_PARTIAL)
> + s += 0.5 * mvhist->buckets[i]->ntuples;
>
> Isn't it will be better that take some percentage of the bucket based
> on the number of distinct element for partial matching buckets.
>

I don't think so, for the same reason why ineq_histogram_selectivity()
in selfuncs.c uses

binfrac = 0.5;

for partial bucket matches - it provides minimum average error. Even if
we knew the number of distinct items in the bucket, we have no idea what
the distribution within the bucket looks like. Maybe 99% of the bucket
are covered by a single distinct value, maybe all the items are squashed
on one side of the bucket, etc.

Moreover we don't really know the number of distinct values in the
bucket - we only know the number of distinct items in the sample, and
only while building the histogram. I don't think it makes much sense to
estimate the number of distinct items in a bucket, because the buckets
contain only very few rows so the estimates would be wildly inaccurate.

>
> +static int
> +update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
> + int2vector *stakeys,
> + MVSerializedHistogram mvhist,
> + int nmatches, char *matches,
> + bool is_or)
> +{
> + int i;
>
> For each clause we are processing all the buckets, can't we use some
> data structure which can make multi-dimensions information searching
> faster.
>

No, we're not processing all buckets for each clause. We're' only
processing buckets that were not "ruled out" by preceding clauses.
That's the whole point of the bitmap.

For example for condition (a=1) AND (b=2), the code will first evaluate
(a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was
evaluated as true. Similarly for OR clauses.

>
> Something like HTree, RTree, Maybe storing histogram in these formats
> will be difficult?
>

Maybe, but I don't want to do that in the first version. I'm not opposed
to doing that in the future, if we find out the v1 histograms are not
efficient (I don't think we will, based on tests I did while working on
the patch). Support for other histogram implementations is pretty much
why there is 'type' field in the struct.

For now I think we should stick with the simple implementation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-01-30 19:36:12 Re: multivariate statistics (v19)
Previous Message Stephen Frost 2017-01-30 19:31:08 Re: [COMMITTERS] pgsql: test_pg_dump TAP test whitespace cleanup