Re: Does "correlation" mislead the optimizer on large

From: Ron Mayer <ron(at)intervideo(dot)com>
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Does "correlation" mislead the optimizer on large
Date: 2003-01-24 20:04:12
Message-ID: Pine.LNX.4.44.0301241138070.986-100000@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Fri, 24 Jan 2003, Stephan Szabo wrote:
>
> I think it's a clumping effect.

Yup, I think that's exactly the effect.

A proposal.... (yes I I'm volunteering if people point me in the right
direction)... would be to have a "plugable" set of analyze functions so that a
huge database that runs analyze infrequently could choose to have a very slow
analyze that might work better for it's data.

I see no reason different analyze functions would to be compiled into
the source code ... but could probably exists as PL/pgSQL languages.

The one thing compiling it in would help with is to let me know
the exact number of tuples on each individual page, but I guess
reltuples/relpages from pg_class is a good estimate.

> For example, I made a table (ordered) with 20 values of a, 50 values of b
> (each showing up in each a) and 100 values of c (not used, just means 100
> rows for each (a,b) combination. It's got 541 pages it looks like. Analyze
> sets the correlation to about 0.08 on the table and so a query like:
> select * from test1 where b=1; prefers a sequence scan (1791 vs 2231)
> while the index scan actually performs about 5 times better.

That sounds like the same situation I was in. If my logic is right, this
means you had about 184 tuples/page (200*50*100/541), so it looks to me
like for each "a", you get half-a-page where "b=1".

If you had 'c' have 200 values, I think you'd get even a bigger speedup
because half the page is still "wasted" with b=2 values.

If you had 'c' have 10000 values, I think you'd get even a slightly bigger
speedup because you'd have so many b=1 pages next to each other you'd
benefit from more sequential disk access.

> I guess the reason is that in general, the index scan *really* is reading
> something on the order of 40 pages rather than the much larger estimate
> (I'd guess something on the order of say 300-400? I'm not sure how to
> find that except by trying to reverse engineer the estimate number),

Or by adding a printf()... I think it'd be in cost_index in costsize.c.

> because pretty much each value of a will probably have 1 or 2 pages with
> b=1.
>
> I'm not really sure how to measure that, however.

As I said... I'm happy to volunteer and experiment if people point
me in a good direction.

Ron

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2003-01-24 20:22:21 Re: Does "correlation" mislead the optimizer on large
Previous Message Ron Mayer 2003-01-24 19:36:50 Re: Does "correlation" mislead the optimizer on large