Re: Collect frequency statistics for arrays

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-05 14:02:24
Message-ID: CAPpHfdtwyn7QFy-1mo+fHHof=Bf8BTdbOO84a_aH9gGLXTmb0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> BTW, one other thing about the count histogram: seems like we are
> frequently generating uselessly large ones. For instance, do ANALYZE
> in the regression database and then run
>
> select tablename,attname,elem_count_histogram from pg_stats
> where elem_count_histogram is not null;
>
> You get lots of entries that look like this:
>
> pg_proc | proallargtypes |
> {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}
> pg_proc | proargmodes |
> {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.61111}
> pg_proc | proargnames |
> {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}
> pg_proc | proconfig |
> {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
> pg_class | reloptions |
> {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
>
> which seems to me to be a rather useless expenditure of space.
> Couldn't we reduce the histogram size when there aren't many
> different counts?
>
> It seems fairly obvious to me that we could bound the histogram
> size with (max count - min count + 1), but maybe something even
> tighter would work; or maybe I'm missing something and this would
> sacrifice accuracy.
>

True. If (max count - min count + 1) is small, enumerating of frequencies
is both more compact and more precise representation. Simultaneously,
if (max count - min count + 1) is large, we can run out of
statistics_target with such representation. We can use same representation
of count distribution as for scalar column value: MCV and HISTOGRAM, but it
would require additional statkind and statistics slot. Probably, you've
better ideas?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2012-03-05 14:34:37 Re: autovacuum locks
Previous Message Pavel Stehule 2012-03-05 13:49:01 Re: review: CHECK FUNCTION statement