Re: Optimizer improvements: to do or not to do?

From: "Say42" <andrews42(at)yandex(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizer improvements: to do or not to do?
Date: 2006-09-12 10:48:36
Message-ID: 1158058116.453148.224620@i42g2000cwa.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> This type of work is 90% analysis, 10% coding. You'll need to do a lot
> of investigation, lots of discussion and listening.

I absolutely agree with you and I am not about to rush into coding
right now. First of all I'm going to dig a lot in the PG sources,
readme's and so on. It's a good school of coding and DBMS internals
understanding.

> > That's what I want to do:
> > 1. Replace not very useful indexCorrelation with indexClustering.
>
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

Sometimes the indexCorrelation even wrongful. There are many examples
of overestimation of index scan cost (data well-clustered but not
ordered - correlation is low) and some cases of underestimation when
tuples look like well ordered with high degree of correlation, but
index scan actually causes random page fetches (1-3-2-4-6-5, for
example. On server without RAID it is VERY slow. 25 times slower than
bitmap index scan). If we have special clustering measure we can more
precisely estimate pages count.
The next step could be to introduce 'ordering' as a measure of
pages access sequentiality. Without the 'ordering' all we can
assume that pages are fetched in random order. Anyhow, if index access
cost is overestimated we can set random_page_cost=2. (Is it true in a
production database with smart RAID?)
Moreover, I think problem is more complex. With assumption that index
access is always random we dip in another problem: overestimation of
master table index scan. If it is small enough PG can choose seq scan
instead of index scan even if the last one actually much cheaper
because of caching. That is why caching should be taking into account
during joining cost calculation.

> > 2. Consider caching of inner table in a nested loops join during
> > estimation total cost of the join.

> I'd work on one thing at a time and go into it deeply.

Good news. So I'm very interested in what you think about my ideas.
Is it wrong or too naive?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomi NA 2006-09-12 11:22:28 Re: [pgsql-advocacy] Thought provoking piece on
Previous Message Stefan Kaltenbrunner 2006-09-12 10:43:18 Re: -HEAD planner issue wrt hash_joins on dbt3 ?