Re: Cross-column statistics revisited

From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 00:00:20
Message-ID: BBCB1CB3-976C-4572-B1D8-E0F4D0E3345C@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is yet another issue entirely. This is about estimating how much
io will be random io if we do an index order scan. Correlation is a
passable tool for this but we might be able to do better.

But it has nothing to do with the cross-column stats problem.

greg

On 17 Oct 2008, at 01:29 AM, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
wrote:

> Josh Berkus wrote:
>>> Yes, or to phrase that another way: What kinds of queries are being
>>> poorly optimized now and why?
>> Well, we have two different correlation problems. One is the
>> problem of dependant correlation, such as the 1.0 correlation of
>> ZIP and CITY fields as a common problem. This could in fact be
>> fixed, I believe, via a linear math calculation based on the
>> sampled level of correlation, assuming we have enough samples. And
>> it's really only an issue if the correlation is
>>> 0.5.
>
> I'd note that this can be an issue even without 2 columns involved.
>
> I've seen a number of tables where the data is loaded in batches
> so similar-values from a batch tend to be packed into relatively few
> pages.
>
> Thinks a database for a retailer that nightly aggregates data from
> each of many stores. Each incoming batch inserts the store's data
> into tightly packed disk pages where most all rows on the page are for
> that store. But those pages are interspersed with pages from other
> stores.
>
> I think I like the ideas Greg Stark had a couple years ago:
> http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
> "...sort the sampled values by value
> and count up the average number of distinct blocks per value.... Or
> perhaps we need a second histogram where the quantities are of
> distinct pages rather than total records.... We might also need a
> separate "average number of n-block spans per value"
> since those seem to me to lead more directly to values like "blocks
> that need to be read".
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-10-17 00:32:38 Re: Cross-column statistics revisited
Previous Message Ron Mayer 2008-10-16 23:29:27 Re: Cross-column statistics revisited