Re: Collect frequency statistics for arrays

From: Noah Misch <noah(at)leadboat(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-01-17 10:33:22
Message-ID: 20120117103322.GA19977@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote:
> Thanks for your fixes to the patch. Them looks correct to me. I did some
> fixes in the patch. The proof of some concepts is still needed. I'm going
> to provide it in a few days.

Your further fixes look good. Could you also answer my question about the
header comment of mcelem_array_contained_selec()?

/*
* Estimate selectivity of "column <@ const" based on most common element
* statistics. Independent element occurrence would imply a particular
* distribution of distinct element counts among matching rows. Real data
* usually falsifies that assumption. For example, in a set of 1-element
* integer arrays having elements in the range [0;10], element occurrences are
* not independent. If they were, a sufficiently-large set would include all
* distinct element counts 0 through 11. We correct for this using the
* histogram of distinct element counts.
*
* In the "column @> const" and "column && const" cases, we usually have
* "const" with low summary frequency of elements (otherwise we have
* selectivity close to 0 or 1 correspondingly). That's why the effect of
* dependence related to distinct element counts distribution is negligible
* there. In the "column <@ const" case, summary frequency of elements is
* high (otherwise we have selectivity close to 0). That's why we should do
* correction due to array distinct element counts distribution.
*/

By "summary frequency of elements", do you mean literally P_0 + P_1 ... + P_N?
If so, I can follow the above argument for "column && const" and "column <@
const", but not for "column @> const". For "column @> const", selectivity
cannot exceed the smallest frequency among const elements. A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-01-17 10:38:23 Re: Online base backup from the hot-standby
Previous Message Simon Riggs 2012-01-17 10:31:47 Re: WAL Restore process during recovery