Re: Postgresqlism & Vacuum?

From: Stephen J Lombardo <lombardo(at)mac(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgresqlism & Vacuum?
Date: 2000-04-14 15:58:48
Message-ID: B51CB878.1F45%lombardo@mac.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


> I think there must be something wrong with the optimiser that it's
> "postgresqlism" that you must vacuum analyze frequently. Just as an example,
> for Clipper (dBase compiler), it's Clipperism that you must re-index if you
> cannot locate some records just because the indexing module screws up.
>
> For large 24x7 installations, it's impossible to vacuum nightly because when
> postgresql is vacuuming the table is locked up, to the end-user the database
> has already hung.
>
> There has been effort to speed up the vacuuming process, but this isn't the
> cure. I believe the fault lies on the optimizer.
>
> For eg, in Bruce Momjian's FAQ 4.9:
>
> PostgreSQL does not automatically maintain statistics. One has to make
> an explicit vacuum call to update the statistics. After statistics are
> updated, the optimizer knows how many rows in the table, and can
> better decide if it should use indices. Note that the optimizer does
> not use indices in cases when the table is small because a sequential
> scan would be faster.
>
> Why save on micro-seconds to use sequential scan when the table is small and
> later 'forgets' that the table is now big because you didn't vacuum analyze?
> Why can't the optimizer just use indexes when they are there and not
> 'optimize' for special cases when the table is small to save micro-seconds?

Because small is a relative term. You will notice that Bruce does not
say "where a table is less than 100 tuples" or something like that. And
because in the end you would probably waste significantly more time than a
few micro-seconds. Consider a table where you have some round number of
tuples, say 100,000. Suppose you had b-tree indexes on two attributes,
employee_id (primary key) and last_name. Now if you were to run a query to
look up an employee by the primary key you would surly want to use the
index. Assume that it would take 3 disk accesses to search the index, and
one to fetch the data page from the heap. So you have a total of 4 disk
accesses to search on primary key and retrieve on row. Now suppose you were
going to run a query that would return a significant number of rows, lets
say half the table (50,000). Now if the optimizer chose to use the index on
that query it would take 4 disk access to locate each and every row (3 to
search index, 1 to grab data page). So if the query ran using the index it
would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
Using CLUSTER could improve the efficiency). Lets assume that the average
size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
single page. Therefore doing a sequential search on the table would require
100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
could make a big difference. Suppose the large query was run only 10 times a
day, that would waste around 2 million disk accesses. Now if you were using
a join performance would suffer even more.
The job of the optimizer is to make educated decisions about how to run
a query. Stats will help it out significantly, but it is expensive to
maintain statistics on a running database and it would decrease overall
performace. Instead the answer is to collect statistics periodically. There
is reasoning behind this to. Consider a table where you have 1,000,000
tuples. One of the attributes is called state. Currently there are only 5
states in the database. A query is run like this:

SELECT state FROM table_name WHERE state='NY';

The optimizer will see if it has any statistics on this table. If not it
will make a guess at how many rows are returned. So the optimizer guesses
that 1% of the table, or 10,000 rows, will be returned. Then it will use
that number to asses how to run the query. Now if it had statistics on the
table the optimizer would know that there were only 5 different values in
the states column of the table. So the optimizer would assume that 20% of
the table would be returned from the query. It is likely that the optimizer
will choose a very different plan when it thinks that 200,000 rows will be
returned.
You can be confident that the fine PostgreSQL developers have done a
good job with the optimizer. There are reasons that things are done the way
they are, but they might not be immediatly apparent.

Cheers,
Stephen

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ed Loehr 2000-04-14 16:39:50 Re: anybody know who the damn list owner is? RE: Postgresqlism & Vacuum?
Previous Message Ed Loehr 2000-04-14 15:34:10 Re: Postgresqlism & Vacuum?