Skip site navigation (1)
Skip section navigation (2)
## WIP: collect frequency statistics for arrays

**Attachment: arrayanalyze-0.1.patch.gz**

Description: application/x-gzip (9.8 KB)
### Responses

### pgsql-hackers by date

WIP patch of statistics collection for arrays is attached. It generally copies statistics collection for tsvector, but there are following differencies: 1) Default comparison, hash and equality function for element data type is used (from corresponding default operator classes). 2) Operators @> and && don't takes care about element occurence count in array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics collection and selectivity estimation functions takes care about uniqueness counting of array element. 3) array_typanalyze collects frequency of null element into separate value (like maximum and minimum frequencies in ts_typanalyze). Currently it is not used in selectivity estimation, but it can be useful in future. Also I've faced with following problems: 1) Do selectivity estimation for ANY and ALL keywords seems not so easy as for operators because their selectivity is estimating inside planner. So it's required to modify planner to do selectivity estimation for these keywords. Probably I'm missing something. 2) I didn't implement selectivity estimation for "column <@ const" and "column == const" cases. The problem of "column <@ const" case is that we need to estimate frequency of occurence of any element not in const. We can try to collect statistics of frequency of all elements which is not in most common elements based on assumption of their independent occurence. But I'm not sure that this statistic will be precise enough. "column == const" case have also another problem. @> and && operators don't takes care about element occurence count and order while == operator require exact match. That's why statistics for @> and && operators can be applied to == very approximately. 3) I need to test selectivity estimation for arrays. But it's hard to understand which distributions is typical for arrays. For example, we know that data in tsvector is based on natural language data, so we can assume something about data distribution in tsvector. But we don't know almost nothing about data in arrays because it can hold any data (tsvector also can holds any data, but it using for non nutural language data is out of purpose). ------ With best regards, Alexander Korotkov.

Description: application/x-gzip (9.8 KB)

- Re: WIP: collect frequency statistics for arrays at 2011-02-25 00:08:01 from Robert Haas
- Re: WIP: collect frequency statistics for arrays at 2011-05-23 17:54:14 from Alexander Korotkov

Next: From:Peter GeogheganDate:2011-02-23 15:09:14Subject: Re: Correctly producing array literals for prepared statementsPrevious: From: PostgreSQL - Hans-Jürgen SchönigDate: 2011-02-23 14:56:59Subject: Re: WIP: cross column correlation ...