Skip site navigation (1) Skip section navigation (2)

Re: Collect frequency statistics for arrays

From: Noah Misch <noah(at)leadboat(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Nathan Boley <npboley(at)gmail(dot)com>,pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-01-12 11:06:27
Message-ID: 20120112110627.GB4923@tornado.leadboat.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On Sat, Jan 07, 2012 at 09:36:42PM +0400, Alexander Korotkov wrote:
> Patch where most part of issues are fixed is attached.

Thanks.  I've made several, largely cosmetic, edits.  See attached version
0.10.  Please use it as the basis for your next version, and feel free to
revert any changes you deem inappropriate.  Where I made non-cosmetic edits, I
attempt to point that out below.  I've left unfixed a few more-substantive
problems, also described below.

When you post another update, could you add it to the open CF?  Given the
timing, I think we might as well consider any further activity to have
happened under the aegis of the 2012-01 CF.  I'm marking the current entry
Returned with Feedback.

> On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > I find distressing the thought of having two copies of the lossy sampling
> > code, each implementing the algorithm with different variable names and
> levels
> > of generality.  We might someday extend this to hstore, and then we'd
> have yet
> > another copy.  Tom commented[1] that ts_typanalyze() and
> array_typanalyze()
> > should remain distinct, and I agree.  However, they could call a shared
> > counting module.  Is that practical?  Possible API:
> >
> > typedef struct LossyCountCtl;
> > LossyCountCtl *LossyCountStart(float s,
> >                                                           float epsilon,
> >                                                           int2 typlen,
> >                                                           bool typbyval,
> >                                                           Oid eqfunc); /*
> + hash func, a few others */
> > void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
> > TrackItem **LossyCountGetAll(LossyCountCtl *ctl);
> >
> > [1]
> http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us
> 
> I'm not sure about shared lossy counting module, because part of shared
> code would be relatively small. Part of compute_array_stats function which
> is taking care about array decompression, distinct occurence calculation,
> disting element count histogram, packing statistics slots etc is much
> larger than lossy counting algorithm itself. May be, there is some other
> opinions in community?

True; it would probably increase total lines of code.  The benefit, if any,
lies in separation of concerns; the business of implementing this algorithm is
quite different from the other roles of these typanalyze functions.  I won't
insist that you try it, though.

> > I think this is an improvement, but some code out there may rely on the
> > ability to get stakind = 4 data from the most_common_vals column.  We'll
> need
> > to mention this in the release notes as an incompatibility.
> 
> I'm not sure I understand mechanism of release notes. Does it require
> something in a patch itself?

No.  I just wanted to call attention to the fact in the hope that someone
remembers as the release notes get drafted.

> > > +             /*
> > > +              * The probability of no occurence of events which forms
> "rest"
> > > +              * probability have a limit of exp(-rest) when number of
> events fo to
> > > +              * infinity. Another simplification is to replace that
> events with one
> > > +              * event with (1 - exp(-rest)) probability.
> > > +              */
> > > +             rest = 1.0f - exp(-rest);
> >
> > What is the name of the underlying concept in probability theory?
> 
> The most closest concept to caculated distribution is multinomial
> distribution. But it's not exactly same, because multinomial distribution
> gives probability of particular count of each event occurece, not
> probability of summary occurence. Actually, distribution is caclulated just
> from assumption of events independence. The most closest concept of rest
> probability is approximation by exponential distribution. It's quite rough
> approximation, but I can't invent something better with low calculation
> complexity.

Do you have a URL of a tutorial or paper that explains the method in more
detail?  If, rather, this is a novel synthesis, could you write a proof to
include in the comments?

> > > + /*
> > > +  * Array selectivity estimation based on most common elements
> statistics for
> > > +  * "column <@ const" case. Assumption that element occurences are
> independent
> > > +  * means certain distribution of array lengths. Typically real
> distribution
> > > +  * of lengths is significantly different from it. For example, if
> even we
> > > +  * have set of arrays with 1 integer element in range [0;10] each,
> element
> > > +  * occurences are not independent. Because in the case of
> independence we
> >
> > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> > '{6,19,4}' cannot appear?
> 
> I refer column where only one element exists, i.e. only possible values are
> '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
> '{10}'. That is a corner case. But similar situation occurs when, for
> example, we've distribution of distinct element count between 1 and 3. It
> significantly differs from distribution from independent occurence.

Oh, I think I see now.  If each element 1..10 had frequency 0.1 independently,
column values would have exactly one distinct element just 39% of the time?

If probability theory has a prototypical problem resembling this, it would be
nice to include a URL to a thorough discussion thereof.  I could not think of
the search terms to find one, though.

> > > +  * have probabilities of length of 0, 1, 2 etc. In the "column @>
> const"
> > > +  * and "column && const" cases we usually have "const" with low
> summary
> > > +  * frequency of elements (otherwise we have selectivity close to 0 or
> 1
> > > +  * correspondingly). That's why effect of dependence related to
> lengths
> > > +  * distribution id negligible there. In the "column <@ const" case
> summary
> > > +  * frequency of elements is high (otherwise we have selectivity close
> to 0).
> >
> > What does the term "summary frequency" denote?
> 
> I meant summ of frequences of "const" array elements.

Do you mean literally P_0 + P_1 ... + P_N?  If so, I can follow the above
argument for "column && const" and "column <@ const", but not for "column @>
const".  For "column @> const", selectivity cannot exceed the smallest
frequency among elements of "const".  Several high-frequency elements together
will drive up the sum of frequencies without increasing the true selectivity.

> > > +     /*
> > > +      * Rest is a average length of elements which aren't present in
> mcelem.
> > > +      */
> > > +     rest = avg_length;
> >
> > You define "rest" here as an array length ...

> > > +                             rest -= numbers[mcelem_index];
> >
> > ... But here, you're subtracting a frequency from an array length?
> >
> 
> Yes, because average distinct element count is summ of frequencies of
> elements. Substracting mcelem frequencies from avg_length we have summ of
> frequencies of non-mcelem elements.

I see now; thanks.  I updated the comments so that this would have been
clearer to me.

> *** /dev/null
> --- b/src/backend/utils/adt/array_selfuncs.c

> + Selectivity
> + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause,
> + 					Oid operator)
> + {
> + 	Oid			elemtype;
> + 	Selectivity selec;
> + 	TypeCacheEntry *typentry;
> + 	Datum	   *hist;
> + 	int			nhist;
> + 	FunctionCallInfoData cmpfunc;
> + 
> + 	elemtype = get_base_element_type(vardata->vartype);
> + 
> + 
> + 	/* Get default comparison function */
> + 	typentry = lookup_type_cache(elemtype,
> + 		   TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR);
> + 
> + 	/* Handle only "=" operator. Return default selectivity in other cases. */
> + 	if (operator != typentry->eq_opr)
> + 		return (Selectivity) 0.5;

Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)".  Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.

> + 
> + 	/* Without comparison function return default selectivity estimation */
> + 	if (!OidIsValid(typentry->cmp_proc))
> + 	{
> + 		if (orClause)
> + 			return DEFAULT_OVERLAP_SEL;
> + 		else
> + 			return DEFAULT_CONTAIN_SEL;
> + 	}

Since "const = ANY (column)" is equivalent to "column @> array[const]" and
"const = ALL (column)" is equivalent to "column <@ array[const]",
DEFAULT_CONTAIN_SEL is always correct here.  I've made that change.

> + /*
> +  * Calculate first n distinct element counts probabilities by histogram. We
> +  * assume that any interval between a and b histogram values gives
> +  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and
> +  * half of that to a and b. Returns total probability that distinct element
> +  * count is less of equal to n.
> +  */
> + static float
> + calc_hist(Datum *hist, int nhist, float *hist_part, int n)

To test this function, I ran the following test case:

set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];

Using gdb to observe calc_hist()'s result during the last command:

(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8:       0       0       0.333333343     0
0xcd4bd8:       0       0.333333343

I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and
a total probability of 1.0.

> + {
> + 	int			k,
> + 				i = 0,
> + 				prev_interval = 0,
> + 				next_interval = 0;
> + 	float		frac,
> + 				total = 0.0f;
> + 
> + 	/*
> + 	 * frac is a probability contribution by each interval between histogram
> + 	 * values. We have nhist - 1 intervals. Contribution of one will be 1 /
> + 	 * (nhist - 1).
> + 	 */
> + 	frac = 1.0f / ((float) (nhist - 1));
> + 	for (k = 0; k <= n; k++)
> + 	{
> + 		int			count = 0;
> + 
> + 		/* Count occurences of k distinct element counts in histogram. */
> + 		while (i < nhist && DatumGetInt32(hist[i]) <= k)
> + 		{
> + 			if (DatumGetInt32(hist[i]) == k)
> + 				count++;
> + 			i++;
> + 		}
> + 
> + 		if (count > 0)
> + 		{
> + 			float		val;
> + 
> + 			/* Find length between current histogram value and the next one */
> + 			if (i < nhist)
> + 				next_interval = DatumGetInt32(hist[i + 1]) -

Doesn't this read past the array end when i == nhist - 1?

> + /*
> +  * Let be n independent events with probabilities p. This function calculates
> +  * probabilities of exact k of events occurence for k in [0;m].
> +  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> +  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
> +  * Each next event increase total number of occured events if it occurs and
> +  * leave last value of that number if it doesn't occur. So, by the law of
> +  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
> +  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> +  * Also there could be some events with low probabilities. Their summary
> +  * probability passed in the rest parameter.
> +  */
> + static float *
> + calc_distr(float *p, int n, int m, float rest)

I attempted to clarify this comment; please see if I preserved its accuracy.

> + 	/*
> + 	 * Using of distinct element counts histogram requires O(nitems * (nmcelem
> + 	 * + nitems)) operations. It's reasonable to limit the number of required
> + 	 * operation and give less accurate answer when this limit exceed.
> + 	 */
> + 	if (nhist > 0 && unique_nitems <=
> + 		300 * default_statistics_target / (nmcelem + unique_nitems))

I benchmarked the quadratic complexity here.  With default settings, this
cutoff skips the algorithm beginning around a 170-element constant array,
which would nonetheless take single-digit milliseconds to plan.  When I
temporarily removed the cutoff, I needed much larger scales to get poor plan
times.  5000 elements took 180ms to plan, and 10000 elements took 620 ms.
Bottom line, the cutoff you've chosen is plenty conservative.

> + static int
> + element_compare(const void *key1, const void *key2, void *arg)
> + {
> + 	const Datum *d1 = (const Datum *) key1;
> + 	const Datum *d2 = (const Datum *) key2;
> + 	FunctionCallInfo cmpfunc = (FunctionCallInfo) arg;
> + 
> + 	cmpfunc   ->arg[0] = *d1;
> + 	cmpfunc   ->arg[1] = *d2;
> + 	cmpfunc   ->argnull[0] = false;
> + 	cmpfunc   ->argnull[1] = false;
> + 	cmpfunc   ->isnull = false;

This indented poorly due to "cmpfunc" having a place in our typedefs list.  I
changed the identifier.

> *** /dev/null
> --- b/src/backend/utils/adt/array_typanalyze.c

> +  *	We set s to be the estimated frequency of the K'th element in a natural
> +  *	language's frequency table, where K is the target number of entries in
> +  *	the MCELEM array. We assume that the distribution of element frequencies
> +  *	follows Zipf's law with an exponent of 1.
> +  *
> +  *	Assuming Zipfian distribution, the frequency of the K'th element is equal
> +  *	to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
> +  *	elements in the language.	Putting W as one million, we get roughly
> +  *	0.07/K. This gives s = 0.07/K.	We set epsilon = s/10, which gives bucket
> +  *	width w = K/0.007 and maximum expected hashtable size of about 1000 * K.

Given the lack of applicability to arrays, I replaced these last two
paragraphs with some weasel words.  My gut feeling is that we're priming the
algorithm to deliver answers far more precise than needed.  However, I haven't
attempted a principled replacement.

> + 	/* This is 'w' from the LC algorithm */
> + 	int			bucket_width;
> + 	int			array_no,
> + 				element_no;

I think it's possible for element_no to overflow.  Consider rows with 2000
distinct elements apiece at a statistics target of 10000 (3M sample rows).
So, I made it a uint64.

> + 	extra_data = (ArrayAnalyzeExtraData *) stats->extra_data;

This still isn't reentrant; you'd need to save the existing static extra_data
and restore it on function exit.  However, it turns out that do_analyze_rel()
itself isn't reentrant on account of its similar management of "anl_context";
any nested ANALYZE crashes the backend.  So, I don't think we need further
change here.  It will be easy to make reentrant later if necessary, though I'd
probably fix do_analyze_rel() by just throwing an error on recursive ANALYZE.

> + 	stats->extra_data = extra_data->std_extra_data;
> + 	old_context = CurrentMemoryContext;
> + 	extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows);
> + 	MemoryContextSwitchTo(old_context);

Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.

> + 	/*
> + 	 * hashtable for arrays distinct element count.
> + 	 */
> + 	MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl));
> + 	count_hash_ctl.keysize = sizeof(int);
> + 	count_hash_ctl.entrysize = sizeof(DistinctElementCountItem);
> + 	count_hash_ctl.hash = tag_hash;
> + 	count_hash_ctl.match = memcmp;
> + 	count_hash_ctl.hcxt = CurrentMemoryContext;
> + 	count_tab = hash_create("Array distinct element count table",
> + 							64,
> + 							&count_hash_ctl,
> + 					HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);

This HASH_COMPARE setting is redundant, so I've removed it.

> + 		/* Skip too large values. */
> + 		if (toast_raw_datum_size(value) > WIDTH_THRESHOLD)

Fixed this warning:

array_typanalyze.c: In function `compute_array_stats':
array_typanalyze.c:361: warning: implicit declaration of function `toast_raw_datum_size'

> + 			continue;
> + 		else
> + 			analyzed_rows++;
> + 
> + 		/*
> + 		 * Add up widths for average-width calculation.  Since it's a array,
> + 		 * we know it's varlena.  As in the regular compute_minimal_stats
> + 		 * function, we use the toasted width for this calculation.
> + 		 */
> + 		total_width += VARSIZE_ANY(DatumGetPointer(value));

Since this is now unused, I removed it.

> + 			/* Lookup current element in hashtable, adding it if new */
> + 			item = (TrackItem *) hash_search(elements_tab,
> + 											 (const void *) &hash_key,
> + 											 HASH_ENTER, &found);
> + 
> + 			if (found)
> + 			{

I added a pfree(hash_key) here.  In one of my default_statistics_target=3000
tests on a table with few possible elements, this saved hundreds of megabytes
of memory.

> + 				int			i;
> + 
> + 				/*
> + 				 * The element is already on the tracking list. Check if it's
> + 				 * first occurence of this element in array.
> + 				 */
> + 				for (i = 0; i < occurence_index; i++)
> + 				{
> + 					if (occurences[i] == item)
> + 						break;
> + 				}

This wasn't what I had in mind when I suggested the different approach last
time.  See how I changed it in this version, and let me know if you see any
essential disadvantages.

> + 		/* Update frequency of particular array distinct element count. */
> + 		count_item = (DistinctElementCountItem *) hash_search(count_tab,
> + 															&occurence_index,
> + 											  HASH_ENTER, &count_item_found);
> + 		if (count_item_found)
> + 			count_item->frequency++;
> + 		else
> + 		{
> + 			count_item->count = occurence_index;

The key gets initialized automatically, so I removed this line.

> + 			count_item->frequency = 1;
> + 		}
> + 		total_distinct_count += occurence_index;

total_distinct_count seemed to follow element_no exactly, so I removed it.

> *** a/src/include/catalog/pg_statistic.h
> --- b/src/include/catalog/pg_statistic.h

> ***************
> *** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic;
> --- 268,285 ----
>    */
>   #define STATISTIC_KIND_MCELEM  4
>   
> + /*
> +  * A "length histogram" slot describes the distribution of lengths of data for
> +  * datatypes where length is important for selectivity estimation. stavalues
> +  * contains M (>=2) non-null values that divide the non-null column data values
> +  * into M-1 bins of approximately equal population. The first stavalues item
> +  * is the minimum length and the last is the maximum length. In dependence on
> +  * datatype this slot can hold distribution of not exactly length, but of
> +  * similar value. For instance, it hold distribution of distinct elements count
> +  * for arrays, because multiple occurences of array elements are ignored by
> +  * array comparison operators. 
> +  *
> +  */
> + #define STATISTIC_KIND_LENGTH_HISTOGRAM  5

I changed this text to say that we always store distinct element counts.  We
can always update the comment later if we diversify its applications.

> *** a/src/include/catalog/pg_type.h
> --- b/src/include/catalog/pg_type.h

This now updates all array types except record[].  I'm don't know offhand how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it.  However, is there a particular reason to
make that one different?

Thanks,
nm

In response to

Responses

pgsql-hackers by date

Next:From: Simon RiggsDate: 2012-01-12 11:14:51
Subject: Re: CLOG contention, part 2
Previous:From: Fujii MasaoDate: 2012-01-12 10:37:23
Subject: Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group