Re: Collect frequency statistics for arrays

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-01-07 17:36:42
Message-ID: CAPpHfdusvZH6zi1iRp627iEE6V4YD5pU5GSbqtx_995sJg7w4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Patch where most part of issues are fixed is attached.

On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> I find distressing the thought of having two copies of the lossy sampling
> code, each implementing the algorithm with different variable names and
levels
> of generality. We might someday extend this to hstore, and then we'd
have yet
> another copy. Tom commented[1] that ts_typanalyze() and
array_typanalyze()
> should remain distinct, and I agree. However, they could call a shared
> counting module. Is that practical? Possible API:
>
> typedef struct LossyCountCtl;
> LossyCountCtl *LossyCountStart(float s,
> float epsilon,
> int2 typlen,
> bool typbyval,
> Oid eqfunc); /*
+ hash func, a few others */
> void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
> TrackItem **LossyCountGetAll(LossyCountCtl *ctl);
>
> [1]
http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us

I'm not sure about shared lossy counting module, because part of shared
code would be relatively small. Part of compute_array_stats function which
is taking care about array decompression, distinct occurence calculation,
disting element count histogram, packing statistics slots etc is much
larger than lossy counting algorithm itself. May be, there is some other
opinions in community?

> I think this is an improvement, but some code out there may rely on the
> ability to get stakind = 4 data from the most_common_vals column. We'll
need
> to mention this in the release notes as an incompatibility.

I'm not sure I understand mechanism of release notes. Does it require
something in a patch itself?

> > + /*
> > + * Let be n independent events with probabilities p. This function
calculates
> > + * probabilities of exact k of events occurence for k in [0;m].
> > + * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> > + * probability that exact j of first i events occurs. Obviously
M[0,0] = 1.
> > + * Each next event increase total number of occured events if it
occurs and
> > + * leave last value of that number if it doesn't occur. So, by the
law of
> > + * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j
- 1] * p[i]
> > + * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> > + * Also there could be some events with low probabilities. Their
summary
> > + * probability passed in the rest parameter.
> > + */
> > + static float *
> > + calc_distr(float *p, int n, int m, float rest)
> > + {
>
> > + /* Take care about events with low probabilities. */
> > + if (rest > 0.0f)
> > + {
> > + /*
> > + * The probability of no occurence of events which forms
"rest"
> > + * probability have a limit of exp(-rest) when number of
events fo to
> > + * infinity. Another simplification is to replace that
events with one
> > + * event with (1 - exp(-rest)) probability.
> > + */
> > + rest = 1.0f - exp(-rest);
>
> What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial
distribution. But it's not exactly same, because multinomial distribution
gives probability of particular count of each event occurece, not
probability of summary occurence. Actually, distribution is caclulated just
from assumption of events independence. The most closest concept of rest
probability is approximation by exponential distribution. It's quite rough
approximation, but I can't invent something better with low calculation
complexity.

> > + /*
> > + * Array selectivity estimation based on most common elements
statistics for
> > + * "column <@ const" case. Assumption that element occurences are
independent
> > + * means certain distribution of array lengths. Typically real
distribution
> > + * of lengths is significantly different from it. For example, if
even we
> > + * have set of arrays with 1 integer element in range [0;10] each,
element
> > + * occurences are not independent. Because in the case of
independence we
>
> Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> '{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values are
'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
'{10}'. That is a corner case. But similar situation occurs when, for
example, we've distribution of distinct element count between 1 and 3. It
significantly differs from distribution from independent occurence.

> > + * have probabilities of length of 0, 1, 2 etc. In the "column @>
const"
> > + * and "column && const" cases we usually have "const" with low
summary
> > + * frequency of elements (otherwise we have selectivity close to 0 or
1
> > + * correspondingly). That's why effect of dependence related to
lengths
> > + * distribution id negligible there. In the "column <@ const" case
summary
> > + * frequency of elements is high (otherwise we have selectivity close
to 0).
>
> What does the term "summary frequency" denote?

I meant summ of frequences of "const" array elements.

> > + /*
> > + * Rest is a average length of elements which aren't present in
mcelem.
> > + */
> > + rest = avg_length;
>
> You define "rest" here as an array length ...
>
> > +
> > + default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
> > +
> > + mcelem_index = 0;
> > +
> > + /*
> > + * mult is the multiplier that presents estimate of probability
that each
> > + * mcelem which is not present in constant doesn't occur.
> > + */
> > + mult = 1.0f;
> > +
> > + for (i = 0; i < nitems; i++)
> > + {
> > + bool found = false;
> > +
> > + /* Comparison with previous value in order to guarantee
uniquness */
> > + if (i > 0)
> > + {
> > + if (!element_compare(&array_data[i - 1],
&array_data[i]))
> > + continue;
> > + }
> > +
> > + /*
> > + * Iterate over mcelem until find mcelem that is greater
or equal to
> > + * element of constant. Simultaneously taking care about
rest and
> > + * mult. If that mcelem is found then fill corresponding
elem_selec.
> > + */
> > + while (mcelem_index < nmcelem)
> > + {
> > + int cmp =
element_compare(&mcelem[mcelem_index], &array_data[i]);
> > +
> > + if (cmp < 0)
> > + {
> > + mult *= (1.0f - numbers[mcelem_index]);
> > + rest -= numbers[mcelem_index];
>
> ... But here, you're subtracting a frequency from an array length?
>

Yes, because average distinct element count is summ of frequencies of
elements. Substracting mcelem frequencies from avg_length we have summ of
frequencies of non-mcelem elements.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
arrayanalyze-0.9.patch.gz application/x-gzip 23.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-01-07 19:19:04 bgwriter holds onto file handles of deleted files
Previous Message Tom Lane 2012-01-07 17:30:33 Re: Intermittent regression test failures from index-only plan changes