Re: tuning Postgres for large data import (using Copy from)

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "Marc Mamin" <m(dot)mamin(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: tuning Postgres for large data import (using Copy from)
Date: 2005-05-12 17:25:41
Message-ID: 200505121025.41556.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Marc,

> 1) use 3 different disks for:
>
> -1: source data
> -2: index tablespaces
> -3: data tablespaces

Others have already told you about the importance of relocating WAL. If you
are going to be building indexes on the imported data, you might find it
beneficial to relocate pgsql_tmp for the database in question as well.
Also, I generally find it more beneficial to seperate the few largest tables
to their own disk resources than to put all tables on one resource and all
disks on another. For example, for TPCH-like tests, I do
array0: OS and pgsql_tmp
array1: LINEITEM
array2: LINEITEM Indexes
array3: all other tables and indexes
array4: pg_xlog
array5: source data

This allows me to load a 100G (actually 270G) TPCH-like database in < 2 hours,
not counting index-building.

> 2) define all foreign keys as initially deferred

It would be better to drop them before import and recreate them afterwards.
Same for indexes unless those indexes are over 2G in size.

> max_connections =20
> shared_buffers =30000
> work_mem = 8192

Not high enough, unless you have very little RAM. On an 8G machine I'm using
256MB. You might want to use 64MB or 128MB.

> maintenance_work_mem = 32768

REALLY not high enough. You're going to need to build big indexes and
possibly vacuum large tables. I use the maximum of 1.98GB. Use up to 1/3 of
your RAM for this.

> checkpoint_segments = 12

Also way too low. Put pg_xlog on its own disk, give in 128 to 512 segments
(up to 8G).

> The server runs RedHat and has 1GB RAM

Make sure you're running a 2.6.10+ kernel. Make sure ext3 is set noatime,
data=writeback. Buy more RAM. Etc.

> How does Postgres handle concurrent copy from on: same table / different
> tables ?

Same table is useless; the imports will effectively serialize (unless you use
pseudo-partitioning). You can parallel load on multiple tables up to the
lower of your number of disk channels or number of processors.

--
Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2005-05-12 17:33:41 Re: Partitioning / Clustering
Previous Message PFC 2005-05-12 16:25:05 Re: Partitioning / Clustering