Re: order by and index path

From: jwieck(at)debis(dot)com (Jan Wieck)
To: andreas(dot)zeugswetter(at)telecom(dot)at (Andreas Zeugswetter)
Cc: jwieck(at)debis(dot)com, hackers(at)postgreSQL(dot)org
Subject: Re: order by and index path
Date: 1998-10-15 11:35:09
Message-ID: m0zTlgT-000EBRC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andread wrote:
> > (who really selects nearly all rows from a 5M row
> > table?).
>
> Data Warehouse apps
>
> > This will hurt if someone really selects most of the rows and the index
> > scan jumps over the disc.
>
> I think this is a non issue, since if a qual is not restrictive enough,
> the optimizer should choose a seq scan anyway. Doesn' t it do this already ?
> [...]

Right and wrong.

Right - it is the optimizers job to decide if an index should
be used or not. And the decision has to be made based on the
cost.

Wrong - PostgreSQL's query optimizer doesn't do it already.
It assumes that a qualification is always restrictive enough
and chooses an index scan any time if the qualification can
be thrown into the indexqual.

In the following I only discuss the situation where
qualifications can be used in the indexqual.

Calculating the cost of a query is easy. Have N tuples in P
data-pages where the given qualification will match M.
Assuming that the tuples are not in ascending order in the
data pages, the cost fetching one tuple by its TID raises
with P (more seeking necessary). Now you can calculate the
cost of an index scan by C=M/N*P*F where F is some constant
factor to make C comparable to a current seqscan cost value
(I know, it must be smarter, but for this description a
simple calculation is enough).

The only problem is that the optimizer has absolutely no
chance to estimate M (the mystic value as Bruce called it).
In a given qualification

WHERE key > 0 AND key <= 100

it cannot know if this would result in 0 or 100% of the rows.
To estimate that, it needs statistical information about the
key ranges that are present. Assume there would be 11 keys
remembered by the last vacuum run, that break up the whole
present key range of 10000 tuples into 10 chunks and they
read

5 40 70 90 500 600 1000 1100 1400 1500 2000

where 5 is the lowest key at all, 40 is the key of tuple 1000
(in key order), 70 is the key of tuple 2000 and so on. Now
looking at the qualification and this key range information
would tell, that the absolute limit of rows returned by an
index scan would be 3999 (which still could have a key value
of 100). But the qualification

WHERE key >= 50

could return at max 8999 tuples and

WHERE key > 50 AND key < 70

has a maximum of 998 result tuples. This would be the
information required to make the right decision for the case
where all rows selected are wanted.

We do not have this statistical information. So the whole
thing is at this time academic.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck(at)debis(dot)com (Jan Wieck) #

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message D'Arcy J.M. Cain 1998-10-15 11:38:04 Re: [HACKERS] PostgreSQL v6.4 BETA2...
Previous Message Marc G. Fournier 1998-10-15 09:59:35 Re: [HACKERS] postmaster locking issues.