Collection of PostgreSQL Performance Tips

From: "Brubaker, Shane" <Shane_Brubaker(at)affymetrix(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Cc: Transcriptome(at)neomorphic(dot)com, "Wheeler, Raymond" <Ray_Wheeler(at)affymetrix(dot)com>, "Wong, Brant" <Brant_Wong(at)affymetrix(dot)com>
Subject: Collection of PostgreSQL Performance Tips
Date: 2002-04-25 17:25:50
Message-ID: 53386E0C47E7D41194BB0002B325C997747385@NTEX60
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

This is a collection of many performance tips that we've gathered together
at Affymetrix, and I thought it would be useful to post them to the
PostgreSQL news group.

The single most helpful trick has been the "Partial index trick" at the
bottom and the use of temp tables. Most of these tricks came from either
this news group, or from my colleagues in the bioinformatics department, so
I'd like to thank and acknowledge both groups.

I'd like to thank Tom Lane, who clearly has been working very hard on the
Optimizer, and all the other people who have worked on Postgres. Your
efforts have been invaluable to us. Keep up the good work!

We are currently working on a Transcriptome project, which is a follow-on
to the human genome project, in which we systematically look across all
parts of the genome to see what is expressed in the form of RNA. It is
publicly funded by the National Cancer Institute and the data is made
publicly available at: http://www.netaffx.com/transcriptome/
<http://www.netaffx.com/transcriptome/>

We currently have about 100GB of data and will soon grow to a
multi-terabyte system. We have tables of up to 1 billion rows and have been
able to get ~1 million row queries to run in about 5 min. We've been very
pleased with postgres. After a couple of major outages in our server room,
it came back up flawlessly each time. So it has been an invaluable asset
for this project. We run 7.2 on Red Hat on a 2-processor machine with SAN,
and we have a 128-node linux cluster which will make analysis runs against
the database.

Our main request is continued enhancement of the optimizer for these heavy
types of queries. Improved use of indexes, ability to control execution
plans explicitly, ability to use indexes for data retrieval without touching
the table in certain cases, and other such features would be very useful.
I'm also curious to hear about whether there is any good clustering system
for making a parallel postgres installation, and if others have experience
with creating such large databases.

We've been very happy and impressed with the constant improvements to the
system. Thank You!

This page is a long detailed list of performance tips for doing heavy duty
queries.

* Indexes 1. Indexes are critical. Create exactly the combined
(multi-field) indexes that are being joined in a particular join. The order
of fields in the index and in the join must match exactly.
* Indexes 2. Multi-Field Indexes. Having indexes on individual columns
as well as combinations of 2,3,and 4 columns can help. Sometimes is uses the
3 version, and sometimes it uses one 2 and one singlet index. This can be
helpful, especially when seq scan is turned off and you are using limit.
* Indexes 3. Remember that multiple-field indexes must have the fields
in the correct order as they are accessed in the query. An index can only be
used to the extent allowed by the keys. An index over (A B C) can be used to
find (A B), but not (B C).
* Vacuum. Always vacuum analyze the table(s) after creating indices
(or loading/deleting data).
* Limit and Order by. May have to use order by and/or limit to use the
indexes. May need to use order by with limit. Sometimes order by increases
speed by causing use of an index. Sometimes it decreases speed because a
Sort step is required. A where condition that is sufficiently restrictive
may also cause an index to be used.
* Join Order. Order of fields, joins, and order by fields has a big
impact.
* Casting 1. May have to explicitly cast things. For instance where
x=3 must become (where x=cast(3 as smallint)). This can make a huge
difference.
* Casting 2. Simply adding abs(destype)=(cast 111 as smallint) to my
query and turning seq scans off seems to change the query execution plan.
Writing this as (destype=111 or destype=-111) makes the cost over 7 times
higher!!
* Seq Scans 1. Can you disable seq scans? Yes, you can type "set
enable_seqscan=no;" at the psql prompt and disable it. Do not be surprised
if this does not work though. You can also disable merges, joins, nested
loops, and sorts. Try this and attempt to enable the correct combination
that you want it to use.
* Seq Scans 2. In general you would like it to use an index, but don't
be afraid to try the seq scans if cost is say < 150,000 and see if it it
finishes in a few minutes. For large joins with no where clause, Postgres
always uses seq scans. Try to add a where clause, even a non-restrictive
one, and use an index. However, remember that postgres must go get the table
data too, so this can be more costly. Postgres cannot read data solely from
an index (some commercial databases can).
* Seq Scans 3. Sometimes it is true that seq scans are faster. It
tries to use the analyzed statistics to decide which is better. But don't
always trust it, try it both ways. This is why analyzing your table will
produce different execution plans at after analysis -- The analysis step
will update the stats of the table. The change in estimated costs might
cause a different plan to be chosen.
* Explain Output. Reading the Explain output can be confusing. In
general, the numbers are a range. If you are trying to just get some rows
back, you'd like the left most number to be 0. This means that the
right-most number will probably not happen, because you will not really have
to search the entire table. The right-most number is an upper bound. The
numbers sum as you go up. What you don't want is a large number for both the
min and max. Sometimes a cost of about 100,000 takes about 3 minutes.
Sometimes this is not accurate. Sometimes I was able to to see a lower seq
scan cost, but when I disable seq scans and used indexes, the actual
performance was faster. In general the cost is in milliseconds. Use
Explain Analyze which will run through they query and produce actual times.
* SQL tricks. Remember the standard SQL tricks which I will not cover
here (get a good thick SQL book). For example using Like, etc. can be slow.
Remember that if there is no data in your table for a given where clause, it
must scan the entire result just to tell you "no results found" so know your
data in advance.
* Nested loops are probably the most expensive operation.
* Having several merges and sorts can be way better than having a
single nestloop in your query.
* Explicit Joins. For more than 2 joined tables, consider using
explicit joins (see:
http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.h
tml)
<http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.
html)>
* News Groups. Try the postgres news groups:
http://www.us.postgresql.org/users-lounge/index.html
<http://www.us.postgresql.org/users-lounge/index.html>
* Hardware/Configuration changes. I won't go into a lot of detail here
as this page is more about the query optimizer, but you can look at how much
your CPU and memory is being taxed, and try running postmaster with various
flags to increase speed and memory. However, if your query plan is not
coming out right this will have little impact.
* Identities. You can try typing "and a.id=a.id" and this will
actually help encourage the query planner to use an index. In one example,
select with x=x and y=y order by x worked best (order by y too made it
worse!).
* Temp tables. You may want to explicitly control the query by
breaking it into several steps, with intermediate tables being created along
the way. You can make these true temp tables, which will go away when you
log out, or you may want to keep them around. You might want to create a
procedure or script that automates/hides this process.
* Views. Views sometimes say that they are adding a step to the query
planner, but it does not seem to impact query speed. But if you add more
clauses to the view this may change the query plan in a bad way, which is
confusing to the user.
* Stored Procedures. Try writing a stored procedure to more explicitly
control the query execution. If you do this break out SQL into many small
cursors instead of 1 large cursor, otherwise you will run up against the
same problems.
* External programs. As above, breaking out a query into a series of
small, explicit nested loops in a C, Perl, or other client program, may
actually improve performance (especially if you want a subset of
results/tables).
* Monitor Query Progress. Alan Williams provided a good trick to
monitor the progress of a long running query. If you add to the query a
sequence (select nextval('sq_test'),...) then you can use select
currval('sq_test') to see how far the query has progressed.
* Partial Indices. You can use this feature to force use of an
index!!! (it is also useful as a true partial index). Assume table1 below
has no rows where field1=0. By doing the actions below, it stores the clause
field1<>0 in pg_index and when it sees that predicate, it always uses the
partial index. In this case we are using it as a full index to trick it.
Example:

create index i on table1(field1) where field1 <> 0;

select * from table1 where field1<>0;

Shane Brubaker

BioInformatics Engineer

Affymetric, Inc.

Browse pgsql-sql by date

  From Date Subject
Next Message ARP 2002-04-25 18:34:09 Re: copy command
Previous Message Ian Morgan 2002-04-25 17:04:22 Re: How to discover foreign keys (without pulling hair out)