Re: tuning Postgres for large data import (using Copy from)

From: John A Meinel <john(at)arbash-meinel(dot)com>
To: Marc Mamin <m(dot)mamin(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: tuning Postgres for large data import (using Copy from)
Date: 2005-05-12 14:53:31
Message-ID: 42836DEB.30600@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Marc Mamin wrote:
> Hello,
>

I'm not an expert, but I'll give some suggestions.

>
> I'd like to tune Postgres for large data import (using Copy from).
>

I believe that COPY FROM <file> is supposed to be faster than COPY FROM
STDIN, but <file> must be available to the backend process. If you can
do it, you should think about it, as it eliminates the communication
between the client and the backend.

>
> here are a few steps already done:
>
>
>
> 1) use 3 different disks for:
>
> -1: source data
> -2: index tablespaces
> -3: data tablespaces
>

Make sure pg_xlog is on it's own filesystem. It contains the
write-ahead-log, and putting it by itself keeps the number of seeks
down. If you are constrained, I think pg_xlog is more important than
moving the index tablespaces.

>
> 2) define all foreign keys as initially deferred
>
>
> 3) tune some parameters:
>
>
>
> max_connections =20
> shared_buffers =30000
> work_mem = 8192
> maintenance_work_mem = 32768
> checkpoint_segments = 12
>
> (I also modified the kernel accordingly)
>

Don't forget to increase your free space map if you are going to be
doing deletes frequently.

>
>
>
> 4) runs VACUUM regulary
>
>
> The server runs RedHat and has 1GB RAM
>
> In the production (which may run on a better server), I plan to:
>
> - import a few millions rows per day,
> - keep up to ca 100 millions rows in the db
> - delete older data
>
>
>
>
> I've seen a few posting on hash/btree indexes, which say that hash index do
> not work very well in Postgres;
> currently, I only use btree indexes. Could I gain performances whole using
> hash indexes as well ?
>
I doubt it.

> How does Postgres handle concurrent copy from on: same table / different
> tables ?
>

I think it is better with different tables. If using the same table, and
there are indexes, it has to grab a lock for updating the index, which
causes contention between 2 processes writing to the same table.

>
> I'd be glad on any further suggestion on how to further increase my
> performances.
>

Since you are deleting data often, and copying often, I might recommend
using a partition scheme with a view to bind everything together. That
way you can just drop the old table rather than doing a delete. I don't
know how this would affect foreign key references.

But basically you can create a new table, and do a copy without having
any indexes, then build the indexes, analyze, update the view.

And when deleting you can update the view, and drop the old table.

Something like this:

CREATE TABLE table_2005_05_11 AS (blah);
COPY FROM ... ;
CREATE INDEX blah ON table_2005_05_11(blah);
CREATE OR REPLACE VIEW table AS
SELECT * FROM table_2005_05_10
UNION ALL SELECT * FROM table_2005_05_11;
VACUUM ANALYZE table_2005_05_11;
...

John
=:->

>
>
>
> Marc
>
>
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Alex Stapleton 2005-05-12 15:16:53 Re: Partitioning / Clustering
Previous Message Tom Lane 2005-05-12 14:31:18 Re: tuning Postgres for large data import (using Copy from)