Re: Improve COPY performance for large data sets

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: "Alan Hodgson" <ahodgson(at)simkin(dot)ca>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Improve COPY performance for large data sets
Date: 2008-09-10 17:49:22
Message-ID: a1ec7d000809101049k19b56deh9bc212eaede88d79@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

A single SATA drive may not be the best performer, but:

1. It won't make a load take 12 hours unless we're talking a load that is in
total, similar to the size of the disk. A slow, newer SATA drive will read
and write at at ~50MB/sec at minimum, so the whole 320GB can be scanned at
3GB per minute. Thats ~ 5 hours. It is not likely that 20M records is over
20GB, and at that size there is no way the disk is the bottleneck.

2. To figure out if the disk or CPU is a bottleneck, don't assume. Check
iostat or top and look at the disk utilization % and io wait times. Check
the backend process CPU utilization. In my experience, there are many
things that can cause COPY to be completely CPU bound even with slow disks
-- I have seen it bound to a 5MB/sec write rate on a 3Ghz CPU, which a drive
from 1998 could handle.

It seems like this case is resolved, but there are some other good tuning
recommendations. Don't blame the disk until the disk is actually showing
high utilization though.

COPY is bound typically by the disk or a single CPU. It is usually CPU
bound if there are indexes or constraints on the table, and sometimes even
when there are none.

The pg_bulkload tool in almost all cases, will be significantly faster but
it has limitations that make it inappropriate for some to use.

On Wed, Sep 10, 2008 at 10:14 AM, Alan Hodgson <ahodgson(at)simkin(dot)ca> wrote:

> On Wednesday 10 September 2008, Ryan Hansen <
> ryan(dot)hansen(at)brightbuilders(dot)com>
> wrote:
> >Currently it's taking about 12 hours to complete on a 64 bit
> > server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
> > drive. I don't seem to get any improvement running the same operation
> > on a dual opteron dual-core, 16 GB server.
> >
> > I'm not asking for someone to solve my problem, just some direction in
> > the best ways to tune for faster bulk loading, since this will be a
> > fairly regular operation for our application (assuming it can work this
> > way). I've toyed with the maintenance_work_mem and some of the other
> > params, but it's still way slower than it seems like it should be.
> > So any contributions are much appreciated.
>
> Your drive subsystem, such as it is, is inappropriate for a database. Your
> bottleneck is your drive.
>
> Turning fsync off might help. You should also drop all indexes on the table
> before the COPY and add them back after (which would eliminate a lot of
> random I/O during the COPY).
>
> --
> Alan
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2008-09-10 17:51:34 Re: Improve COPY performance for large data sets
Previous Message Kevin Grittner 2008-09-10 17:47:54 Re: too many clog files