Re: Collect frequency statistics for arrays

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-08 16:30:52
Message-ID: 12427.1331224252@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
>> On reflection my idea above is wrong; for example assume that we have a
>> column with 900 arrays of length 1 and 100 arrays of length 2. Going by
>> what I said, we'd reduce the histogram to {1,2}, which might accurately
>> capture the set of lengths present but fails to show that 1 is much more
>> common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
>> would capture the situation perfectly in one-tenth the space that the
>> current logic does.

> Granted. When the next sample finds 899/101 instead, though, the optimization
> vanishes.

No, you missed my next point. That example shows that sometimes a
smaller histogram can represent the situation with zero error, but in
all cases a smaller histogram can represent the situation with perhaps
more error than a larger one. Since we already have a defined error
tolerance, we should try to generate a histogram that is as small as
possible while still not exceeding the error tolerance.

Now, it might be that doing that is computationally impractical, or
too complicated to be reasonable. But it seems to me to be worth
looking into.

> If you want to materially narrow typical statistics, Alexander's
> proposal looks like the way to go. I'd guess array columns always
> having DEC <= default_statistics_target are common enough to make that
> representation the dominant representation, if not the only necessary
> representation.

Well, I don't want to have two representations; I don't think it's worth
the complexity. But certainly we could consider switching to a
different representation if it seems likely to usually be smaller.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-03-08 16:42:58 Re: pg_stat_statements and planning time
Previous Message Kevin Grittner 2012-03-08 16:29:56 Re: sortsupport for text