Quick Links

Re: Large Database Design Help

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Orion Henry <lab(at)orangekids(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Large Database Design Help
Date:	2006-02-09 19:45:07
Message-ID:	20060209194507.GW57845@pervasive.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Thu, Feb 09, 2006 at 11:07:06AM -0800, Orion Henry wrote:
>
> Hello All,
>
> I've inherited a postgresql database that I would like to refactor. It
> was origionally designed for Postgres 7.0 on a PIII 500Mhz and some
> design decisions were made that don't make sense any more. Here's the
> problem:
>
> 1) The database is very large, the largest table has 40 million tuples.
>
> 2) The database needs to import 10's of thousands of tuples each night
> quickly. The current method is VERY slow.
>
> 3) I can't import new records with a COPY or drop my indexes b/c some of
> them are new records (INSERTS) and some are altered records (UPDATES)
> and the only way I can think of to identify these records is to perform
> a select for each record.
>
> Here is how the database is currently laid out and you'll see why I have
> a problem with it
>
> 1) The data is easily partitionable by client ID. In an attempt to keep
> the indexes small and the inserts fast one table was made per client
> ID. Thus the primary table in the database (the one with 40 million
> tuples) is really 133 tables each ending with a three digit suffix.
> The largest of these client tables has 8 million of the 40 million
> tuples. The system started with around a half dozen clients and is now
> a huge pain to manage with so many tables. I was hoping new hardware
> and new postgres features would allow for this data to be merged safely
> into a single table.

If all the clients are equally active then partitioning by client
probably makes little sense. If some clients are much more active than
others then leaving this as-is could be a pretty big win. If the
partitioning is done with either a view and rules or inherited tables
and rules it shouldn't be too hard to manage.

> 2) The imports are not done inside of transactions. I'm assuming the
> system designers excluded this for a reason. Will I run into problems
> performing tens of thousands of inserts and updates inside a single
> transaction?

Never attribute to thoughtful design that which can be fully explained
by ignorance. :) I'd bet they just didn't know any better.

> 3) The current code that bulk loads data into the database is a loop
> that looks like this:
>
> $result = exe("INSERT INTO $table ($name_str) SELECT
> $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)");
> if ($result == 0)
> {
> $result = exe("UPDATE $table SET $non_keys WHERE
> $keys");
> }
>
> Is there a faster way to bulk load data when it's not known ahead of
> time if it's a new record or an updated record?

Uuuugly. :) Instead, load everything into a temp table using COPY and
then UPDATE real_table ... FROM temp_table t WHERE real_table.key =
t.key and INSERT SELECT ... WHERE NOT EXISTS. But take note that this is
a race condition so you can only do it if you know nothing else will be
inserting into the real table at the same time.

You might want to look at the stats-proc code at
http://cvs.distributed.net; it does exactly this type of thing.

> What I would LIKE to do but am afraid I will hit a serious performance
> wall (or am missing an obvious / better way to do it)
>
> 1) Merge all 133 client tables into a single new table, add a client_id
> column, do the data partitioning on the indexes not the tables as seen here:
>
> CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo (
> order_number ) WHERE client_id = XXX;
> CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo (
> transaction_date ) WHERE client_id = XXX;
>
> (Aside question: if I were to find a way to use COPY and I were
> loading data on a single client_id, would dropping just the indexes for
> that client_id accelerate the load?)

Hrm, I believe it would...

> 2) Find some way to make the bulk loads faster or more efficent (help!)
Don't do things row-by-row. If you can't ensure that there will be only
one process inserting to eliminate the race condition I mentioned above
then reply back and I'll point you at code that should still be much
faster than what you're doing now.

> 3) Wrap each load into a transaction ( tens of thousands of records per
> load )

Getting rid of row-by-row will be your biggest win. If you do have to do
row-by-row, at least wrap it in a transaction. As long as the
transaction doesn't take *too* long it won't be an issue.

Remember that the write performance of raid5 normally stinks.

> # du -sh /var/lib/postgres/data/
> 16G /var/lib/postgres/data/
>
> ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and
> when I do this refactoring )

Going to 8.1 would help in a large number of ways even if you don't
refactor. The stats-proc code I mentioned runs 2x faster under 8.1 than
it does under 7.4.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Large Database Design Help at 2006-02-09 19:07:06 from Orion Henry

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Jim C. Nasby	2006-02-09 19:46:03	Re: Help with optimizing a sql statement
Previous Message	Greg Stark	2006-02-09 19:45:02	Re: Large Database Design Help