Re: More tablescanning fun

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: jim(at)nasby(dot)net
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: More tablescanning fun
Date: 2003-04-25 05:23:10
Message-ID: 13549.1051248190@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> On Thu, Apr 24, 2003 at 07:58:30PM -0400, Tom Lane wrote:
>> There's been some previous debate about the equation used to correct
>> for correlation, which is certainly bogus (I picked it more or less
>> out of the air ;-)). But so far no one has proposed a replacement
>> equation with any better foundation ... take a look in
>> src/backend/optimizer/path/costsize.c if you want to get involved.

> Are you reffering to the PF formula?

The PF formula is good as far as I know, but it assumes an uncorrelated
table order. The debate is how to correct it for nonzero correlation.
Specifically, this bit:

* When the index ordering is exactly correlated with the table ordering
* (just after a CLUSTER, for example), the number of pages fetched should
* be just sT. What's more, these will be sequential fetches, not the
* random fetches that occur in the uncorrelated case. So, depending on
* the extent of correlation, we should estimate the actual I/O cost
* somewhere between s * T * 1.0 and PF * random_cost. We currently
* interpolate linearly between these two endpoints based on the
* correlation squared (XXX is that appropriate?).

I believe the endpoints s*T and PF*random_cost, I think, but the curve
between them is anyone's guess. It's also quite possible that the
correlation stat that we currently compute is inadequate to model what's
going on.

>> No. It's not apparent to me how you could do that without abandoning
>> MVCC, which we're not likely to do.

> Hmm... does MVCC mandate inserts go at the end?

Anywhere that there's free space. The point is that you can't promise
updates will fit on the same page as the original tuple. So whatever
desirable physical ordering you may have started with will surely
degrade over time.

> On the other hand, it might be possible to get the advantages of a
> clustered index without doing a *true* clustered index. The real point
> is to be able to use indexes; I've heard things like 'if you need to
> access more than 10% of a table then using an index would be
> disasterous', and that's not good... that number should really be over
> 50% for most reasonable ratios of fields indexed to fields in table (of
> course field size plays a factor).

If you have to read 50% of a table, you certainly should be doing a
linear scan. There will be hardly any pages you can skip (unless the
table is improbably well clustered), and the extra I/O needed to read
the index will buy you nothing.

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Hannu Krosing 2003-04-25 07:01:31 Re: Important speed difference between a query and a
Previous Message Jim C. Nasby 2003-04-25 04:59:24 Re: More tablescanning fun