Re: High update activity, PostgreSQL vs BigDBMS

From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Guy Rouillier <guyr-ml1(at)burntmail(dot)com>
Cc: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: High update activity, PostgreSQL vs BigDBMS
Date: 2007-01-07 07:23:32
Message-ID: 45A09FF4.7060906@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Guy,

> The application is fairly straightforward, but as you say, what is
> working okay with BigDBMS isn't working as well under PG. I'm going to
> try other configuration suggestions made by others before I attempt
> logic changes. The core logic is unchangeable; millions of rows of data
> in a single table will be updated throughout the day. If PG can't
> handle high volume updates well, this may be brick wall.

Here are a couple things I learned.

ANALYZE is VERY important, surprisingly so even for small tables. I had a case last week where a temporary "scratch" table with just 100 rows was joined to two more tables of 6 and 12 million rows. You might think that a 100-row table wouldn't need to be analyzed, but it does: Without the ANALYZE, Postgres generated a horrible plan that took many minutes to run; with the ANALYZE, it took milliseconds. Any time a table's contents change dramatically, ANALYZE it, ESPECIALLY if it's a small table. After all, changing 20 rows in a 100-row table has a much larger affect on its statistics than changing 20 rows in a million-row table.

Postgres functions like count() and max() are "plug ins" which has huge architectural advantages. But in pre-8.1 releases, there was a big speed penalty for this: functions like count() were very, very slow, requiring a full table scan. I think this is vastly improved from 8.0x to 8.1 and forward; others might be able to comment whether count() is now as fast in Postgres as Oracle. The "idiom" to replace count() was "select col from tbl order by col desc limit 1". It worked miracles for my app.

Postgres has explicit garbage collection via VACUUM, and you have to design your application with this in mind. In Postgres, update is delete+insert, meaning updates create garbage. If you have very "wide" tables, but only a subset of the columns are updated frequently, put these columns in a separate table with an index to join the two tables. For example, my original design was something like this:

integer primary key
very large text column
... a bunch of integer columns, float columns, and small text columns

The properties were updated by the application, but the large text column never changed. This led to huge garbage-collection problems as the large text field was repeatedly deleted and reinserted by the updates. By separating these into two tables, one with the large text column, and the other table with the dynamic, but smaller, columns, garbage is massively reduced, and performance increased, both immediately (smaller data set to update) and long term (smaller vacuums). You can use views to recreate your original combined columns, so the changes to your app are limited to where updates occur.

If you have a column that is *frequently* updated (say, for example, a user's last-access timestamp each time s/he hits your web server) then you definitely want this in its own table, not mixed in with the user's name, address, etc.

Partitioning in Postgres is more powerful than in Oracle. Use it if you can.

Partial indexes are VERY nice in Postgres, if your data is poorly distributed (for example, a mostly-NULL column with a small percentage of very important values).

I'm sure there are more things that others can contribute.

Craig

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shane Ambler 2007-01-07 09:04:08 Re: High update activity, PostgreSQL vs BigDBMS
Previous Message Joshua D. Drake 2007-01-07 04:56:47 Re: High update activity, PostgreSQL vs BigDBMS