Re: Postgresqlism & Vacuum?

From: Jurgen Defurne <defurnj(at)glo(dot)be>
To: postgreSQL general mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgresqlism & Vacuum?
Date: 2000-04-17 19:33:44
Message-ID: 38FB6718.BA8185DB@glo.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Stephen J Lombardo wrote:

> Because small is a relative term. You will notice that Bruce does not
> say "where a table is less than 100 tuples" or something like that. And
> because in the end you would probably waste significantly more time than a
> few micro-seconds. Consider a table where you have some round number of
> tuples, say 100,000. Suppose you had b-tree indexes on two attributes,
> employee_id (primary key) and last_name. Now if you were to run a query to
> look up an employee by the primary key you would surly want to use the
> index. Assume that it would take 3 disk accesses to search the index, and
> one to fetch the data page from the heap. So you have a total of 4 disk
> accesses to search on primary key and retrieve on row. Now suppose you were

You have to consider that the B-tree index itself has the property of clustering
it's keys. This means first of all that with a key like employee-id, you can have
as many as 150 keys in an index block (blocksize 2k, administrative data accounted
for). After the first record is retrieved, your index is sequentially walked.
Thus, this means that you have 1 index access for 150 keys. On account of this,
your index consists of 666 blocks. To retrieve your 50000 records, you need to
read only 333 pages. This means that on average, you need to do only one disk
access per 150 records.

>
> going to run a query that would return a significant number of rows, lets
> say half the table (50,000). Now if the optimizer chose to use the index on
> that query it would take 4 disk access to locate each and every row (3 to
> search index, 1 to grab data page). So if the query ran using the index it
> would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
> Using CLUSTER could improve the efficiency). Lets assume that the average
> size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a

I think you mean 500 bytes here, right ?

>
> single page. Therefore doing a sequential search on the table would require
> 100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
> could make a big difference. Suppose the large query was run only 10 times a
> day, that would waste around 2 million disk accesses. Now if you were using
> a join performance would suffer even more.

Since your data is also spread into 6250 pages, and you want to retrieve 50000 of
them, there are two cases :
a) Complete randomness
b) Complete sequentialness
In the case of complete randomness, you might need to read every page from disk
(in which case you do have 50000 accesses), but this is nonsense, because every
file/DBMS system since COBOL, implements some buffering scheme, so the chance of
finding your data in a block that is already in memory grows.
In the case of complete sequentialness, you will need only to read 3125 blocks.
Also, most systems since COBOL do also implement some read-ahead, so disk access
drops again.

>
> The job of the optimizer is to make educated decisions about how to run
> a query. Stats will help it out significantly, but it is expensive to
> maintain statistics on a running database and it would decrease overall
> performace. Instead the answer is to collect statistics periodically. There
> is reasoning behind this to. Consider a table where you have 1,000,000
> tuples. One of the attributes is called state. Currently there are only 5
> states in the database. A query is run like this:
>
> SELECT state FROM table_name WHERE state='NY';
>
> The optimizer will see if it has any statistics on this table. If not it
> will make a guess at how many rows are returned. So the optimizer guesses
> that 1% of the table, or 10,000 rows, will be returned. Then it will use
> that number to asses how to run the query. Now if it had statistics on the
> table the optimizer would know that there were only 5 different values in
> the states column of the table. So the optimizer would assume that 20% of
> the table would be returned from the query. It is likely that the optimizer
> will choose a very different plan when it thinks that 200,000 rows will be
> returned.

Two cases here :
a) state is indexed
b) state is not indexed
If state is indexed, then the optimizer doesn't need to do much work : just find
the first index block, then walk the tree until state <> 'NY', regardless of the
amount of records to be retrieved.
If it is not indexed, then it should not even attempt to use the index, but just
start searching sequentially.

>
> You can be confident that the fine PostgreSQL developers have done a
> good job with the optimizer. There are reasons that things are done the way
> they are, but they might not be immediatly apparent.
>
> Cheers,
> Stephen

The main thing the optimizer does is searching across which indices a query should
best be done. The statistics of VACUUM ANALYZE can help, but I think that a good
insight in the when and whereabouts of indices helps performance more than adding
indices ad infinitum, doing an analyze and then hope that the system will run
fast.

Jurgen
defurnj(at)glo(dot)be

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jeffrey 2000-04-17 19:48:12 Re: excell to postgres
Previous Message Peter Mount 2000-04-17 19:29:06 Re: To BLOB Or Not To BLOB