Re: Cross-column statistics revisited

From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 22:20:58
Message-ID: FD242D4C-AAB8-4F62-8D61-3DB904A348F1@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Correlation is the wrong tool. In fact zip codes and city have nearly
zero correlation. Zip codes near 00000 are no more likely to be in
cities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with the
information. Consider the case of WHERE city='boston' and zip='02139'
and another query with WHERE city='boston' and zip='90210'. One will
produce many more records than the separate histograms would predict
and the other would produce zero. How do you determine which category
a given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole other
kettle of worms. I'm not sure which is easier but let's do one at a
time?

greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

>
>> Yes, or to phrase that another way: What kinds of queries are being
>> poorly optimized now and why?
>
> Well, we have two different correlation problems. One is the
> problem of
> dependant correlation, such as the 1.0 correlation of ZIP and CITY
> fields
> as a common problem. This could in fact be fixed, I believe, via a
> linear
> math calculation based on the sampled level of correlation, assuming
> we
> have enough samples. And it's really only an issue if the
> correlation is
>> 0.5.
>
> The second type of correlation issue we have is correlating values
> in a
> parent table with *rows* in child table (i.e. FK joins). Currently,
> the
> planner assumes that all rows in the child table are evenly
> distributed
> against keys in the parent table. But many real-world databases
> have this
> kind of problem:
>
> A B
> 1 10000 rows
> 2 10000 rows
> 3 1000 rows
> 4 .. 1000 0 to 1 rows
>
> For queries which cover values between 4..1000 on A, the misestimate
> won't
> be much of a real execution problem. But for values 1,2,3, the
> query will
> bomb.
>
>> The other half of this is that bad selectivity estimates only matter
>> if they're bad enough to change the plan, and I'm not sure whether
>> cases like this are actually a problem in practice.
>
> My experience is that any estimate which is more than 5x wrong (i.e.
> < .2
> or > 5.0) usually causes problems, and 3x sometimes causes problems.
>
> --
> --Josh
>
> Josh Berkus
> PostgreSQL
> San Francisco
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ron Mayer 2008-10-16 23:29:27 Re: Cross-column statistics revisited
Previous Message Josh Berkus 2008-10-16 22:12:18 Re: Cross-column statistics revisited