Re: [HACKERS] Optimizer fails?

From: dg(at)illustra(dot)com (David Gould)
To: mimo(at)interdata(dot)com(dot)pl (Michal Mosiewicz)
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Optimizer fails?
Date: 1998-03-27 19:43:06
Message-ID: 9803271943.AA27390@hawk.illustra.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Michal Mosiewicz writes:
>
> Say you have a table with 1024 indexed values from 0 to 1023. Now, you
> want to select the values less than, say, 64. If you use sequential
> scan, you have to scan 1024 records, right? But if you use index scan,
> you first go to the first node of the tree, you compare 64 with the
> value stored in this node, which (for balanced tree) is 512. Now you you
> have to search through next lower node. You compare 64 to 128 which is
> the value of this node. Then you go to next lower node. You compare 64
> to 64, and yes, we hit the target. After 3 searches, we know that
> everything under this node is the set of records we are looking for
> right? So, what we have to do is to browse through those values, and
> collect the tupple indentifiers.
>
> Note, that once we have done this 3 searches, we have to browse all the
> structure of the tree below. We are looking for 64 values. So, it will
> cost us looking through 128 nodes of the subtree.
>
> OK, so by using an index, we had to check 128 + 3 nodes of the tree.

Our btree indexes are quite a bit better than the balanced tree you suppose.

> Now, let's note, that there has been only a few IO transfers by now. No
> more than few pages. And we have tupple identifiers pointing us to 64
> records. Now we may sort this tids in ascending order to optimise IO.

But, we do not do this tid sort. It really isn't easy as you might have
millions of tids, not just a few. Which would mean doing an external sort.
This might be a nice thing to do, but it isn't there now as far as I know.

> Everything took us 3 + 128 nodes from index + 64 records from table.
> This is defnitely better than reading all 1024 records.

I am not trying to flame you, but it seems to me that you have some
misconceptions about how the indexes work and I am trying only to explain
them a little better.

Using your example of 1024 rows with values from 0 to 1023. Further assuming:

- 8K pages
- 200 byte data rows (including overheads)
- 4 byte keys, so perhaps 32 byte index rows with overheads
- btree index

Then:

Table has 8192 / 200 = 40 rows per page giving 1024 / 40 = 26 pages

Index has 8192 / 32 = 256 keys per page giving 1024 / 256 = 4 leaf pages
and one root page with 4 keys.

Something like:
____________________
index root: | 0, 256, 512, 768 |
--------------------
/ \ \
/ \ \
/ \ \--------------\
/ \ \
_______________ _________________ ____________
leaf0: | 0,1,2...255 | leaf1: | 256,257...511 | leaf2: ....
--------------- ----------------- ------------
| \ \--------------------------------\
| \------------------\ \
| \ \
| \ \
data: | 234 |, | 11 |, | 763 |, | 401 |, | 29 |, | 970 |, | 55 |, ....

To scan the index to get the tids for keys 0...63 will take two page
reads: root page, leaf1.

But, to access the data rows you still need 64 random IOs.

So the total times to read the data rows for keys 0..63 look like:

Using index:

IOs time why

1 20msec read root
1 20msec read leaf0
64 1280msec read 64 rows from leaf pages
--- ---------
66 1320msec total

Using table scan:

IOs time why

1 20msec seek and read 1st page
25 125msec sequential read (5 msec each) of rest of table
--- --------
26 145msec total

Note that this ignores cache effects.

If you assume the cache is big enough for the whole table (and in this case
it is, and assume none of the table is resident initially (cold start) then
the IO analysis is:

Using index (with cache):

IOs time why

1 20msec read root
1 20msec read leaf0
10 200msec read 10 unique data pages (assumes desired keys are
not uniformily distributed, this favors index case)
--- ---------
12 240msec total

Using table scan (with cache):

IOs time why

1 20msec seek and read 1st page
25 125msec sequential read (5 msec each) of rest of table
--- --------
26 145msec total (no difference from uncached case)

Even with the very favorable assumption that the only part of the data pages
need to be read, the sequential scan is still slower here.

> And note, that this is a very simple example. In my case I had a 250MB
> table, and about 1% of records to select from it.

My initial calculation a few posts ago was that the break even was .5% for
the table you described so I still think it is reasonable to do a table scan.

-dg

David Gould dg(at)illustra(dot)com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 1998-03-27 19:56:57 Re: [HACKERS] Going on vacation
Previous Message Bruce Momjian 1998-03-27 19:37:46 Re: [HACKERS] Going on vacation