Re: statistics for array types

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: statistics for array types
Date: 2015-09-16 17:01:41
Message-ID: CAPpHfdv8cjgg7kXNoOtAr1Vekn-hmD6bFzT2D84EknGSgf0_LA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> Hi,
>>
>> On 08/11/2015 04:38 PM, Jeff Janes wrote:
>>
>>> When reviewing some recent patches, I decided the statistics gathered
>>> for arrays had some pre-existing shortcomings.
>>>
>>> The main one is that when the arrays contain rare elements there is
>>> no histogram to fall back upon when the MCE array is empty, the way
>>> there is for scalar stats. So it has to punt completely and resort
>>> to saying that it is 0.5% selectivity without recourse to any data at
>>> all.
>>>
>>> The rationale for applying the threshold before things are eligible
>>> for inclusion in the MCE array seems to be that this puts some
>>> theoretical bound on the amount of error we are likely to have in
>>> that element. But I think it is better to exceed that theoretical
>>> bound than it is to have no data at all.
>>>
>>> The attached patch forces there to be at least one element in MCE,
>>> keeping the one element with the highest predicted frequency if the
>>> MCE would otherwise be empty. Then any other element queried for is
>>> assumed to be no more common than this most common element.
>>>
>>
>> We only really need the frequency, right? So do we really need to keep
>> the actual MCV element? I.e. most_common_elem_freqs does not have the
>> same number of values as most_common_elems anyway:
>>
>> A list of the frequencies of the most common element values, i.e., the
>> fraction of rows containing at least one instance of the given value.
>> Two or three additional values follow the per-element frequencies;
>> these are the minimum and maximum of the preceding per-element
>> frequencies, and optionally the frequency of null elements.
>> (Null when most_common_elems is.)
>>
>> So we might modify it so that it's always defined - either it tracks the
>> same values as today (when most_common_elems is defined), or the
>> frequency of the most common element (when most_common_elems is NULL).
>>
>
> I had also considered that. It requires more changes to make it happen,
> and it seems to create a more complex contract on what those columns mean,
> but without giving a corresponding benefit.
>
>
>>
>> This way we can keep the current theoretical error-bound on the MCE
>> frequencies, and if that's not possible we can have at least the new
>> value without confusing existing code.
>
>
> But if the frequency of the most common element was grossly wrongly, then
> whatever value we stick in there is still going to be grossly wrong.
> Removing the value associated with it isn't going to stop it from being
> wrong. When we do query with the (incorrectly thought) first most common
> element, either it will find and use the wrong value from slot 1, or it
> will find nothing and fall back on the same wrong value from slot 3.
>

Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we
collect no MCEs. Moreover, I think we should store it even when num_mcelem
>= track_len and we haven't cut MCEs we find. In this case we can get more
precise estimation for rare element using the knowledge that all MCEs which
exceeds the threshold are present (assuming their frequecies could be much
higher than threshold).

When there are no MCEs then we should use assumption that there are no
elements more frequent than cutoff_freq / nonnull_cnt. Using lower values
wouldn't be statistically correct.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-09-16 17:12:38 Re: hot_standby_feedback default and docs
Previous Message Alexander Korotkov 2015-09-16 16:47:41 Re: Use pg_rewind when target timeline was switched