Re: statistics for array types

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: statistics for array types
Date: 2015-12-18 13:33:45
Message-ID: CAPpHfduhM3ECbLRwY0vmfuu2NtTEXvQ1L2Mr0W=7et-UZqb72w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 16, 2015 at 8:01 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <
>> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> Hi,
>>>
>>> On 08/11/2015 04:38 PM, Jeff Janes wrote:
>>>
>>>> When reviewing some recent patches, I decided the statistics gathered
>>>> for arrays had some pre-existing shortcomings.
>>>>
>>>> The main one is that when the arrays contain rare elements there is
>>>> no histogram to fall back upon when the MCE array is empty, the way
>>>> there is for scalar stats. So it has to punt completely and resort
>>>> to saying that it is 0.5% selectivity without recourse to any data at
>>>> all.
>>>>
>>>> The rationale for applying the threshold before things are eligible
>>>> for inclusion in the MCE array seems to be that this puts some
>>>> theoretical bound on the amount of error we are likely to have in
>>>> that element. But I think it is better to exceed that theoretical
>>>> bound than it is to have no data at all.
>>>>
>>>> The attached patch forces there to be at least one element in MCE,
>>>> keeping the one element with the highest predicted frequency if the
>>>> MCE would otherwise be empty. Then any other element queried for is
>>>> assumed to be no more common than this most common element.
>>>>
>>>
>>> We only really need the frequency, right? So do we really need to keep
>>> the actual MCV element? I.e. most_common_elem_freqs does not have the
>>> same number of values as most_common_elems anyway:
>>>
>>> A list of the frequencies of the most common element values, i.e., the
>>> fraction of rows containing at least one instance of the given value.
>>> Two or three additional values follow the per-element frequencies;
>>> these are the minimum and maximum of the preceding per-element
>>> frequencies, and optionally the frequency of null elements.
>>> (Null when most_common_elems is.)
>>>
>>> So we might modify it so that it's always defined - either it tracks the
>>> same values as today (when most_common_elems is defined), or the
>>> frequency of the most common element (when most_common_elems is NULL).
>>>
>>
>> I had also considered that. It requires more changes to make it happen,
>> and it seems to create a more complex contract on what those columns mean,
>> but without giving a corresponding benefit.
>>
>>
>>>
>>> This way we can keep the current theoretical error-bound on the MCE
>>> frequencies, and if that's not possible we can have at least the new
>>> value without confusing existing code.
>>
>>
>> But if the frequency of the most common element was grossly wrongly, then
>> whatever value we stick in there is still going to be grossly wrong.
>> Removing the value associated with it isn't going to stop it from being
>> wrong. When we do query with the (incorrectly thought) first most common
>> element, either it will find and use the wrong value from slot 1, or it
>> will find nothing and fall back on the same wrong value from slot 3.
>>
>
> Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we
> collect no MCEs. Moreover, I think we should store it even when num_mcelem
> >= track_len and we haven't cut MCEs we find. In this case we can get more
> precise estimation for rare element using the knowledge that all MCEs which
> exceeds the threshold are present (assuming their frequecies could be much
> higher than threshold).
>
> When there are no MCEs then we should use assumption that there are no
> elements more frequent than cutoff_freq / nonnull_cnt. Using lower values
> wouldn't be statistically correct.
>

​The patch implementing my idea above​ is attached. In your example it
gives following result.

# explain (analyze) select * from foobar where foo @> '{567}';
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2387.00 rows=30 width=61) (actual
time=28.691..28.691 rows=0 loops=1)
Filter: (foo @> '{567}'::integer[])
Rows Removed by Filter: 100000
Planning time: 0.044 ms
Execution time: 28.707 ms
(5 rows)

In this particular example it gives less accurate estimate. However, I
believe it would be more safe estimate in general.

I've faced difficulties with storing empty mcelements
array. update_attstats turns empty array into null, and get_attstatsslot
throws error when trying to read null array. I've changes get_attstatsslot
so that it returns empty array when it meets null. I'm not sure in this
solution.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
array_typanalyze_0_mce.patch application/octet-stream 6.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2015-12-18 13:46:41 Re: Patch: fix lock contention for HASHHDR.mutex
Previous Message Albe Laurenz 2015-12-18 13:09:35 Re: Costing foreign joins in postgres_fdw