I do mass inserts daily into PG.  I drop the all indexes except my primary key and then use the COPY FROM command.  This usually takes less than 30 seconds.  I spend more time waiting for indexes to recreate.


Patrick Hatcher
Macys.Com
 
-----pgsql-performance-owner@postgresql.org wrote: -----

To: pgsql-performance@postgresql.org
From: Christopher Browne <cbbrowne@acm.org>
Sent by: pgsql-performance-owner@postgresql.org
Date: 2004-12-04 06:48AM
Subject: Re: [PERFORM] Improve BULK insertion

In the last exciting episode, grupos@carvalhaes.net (Grupos) wrote:
> Hi !
>
> I need to insert 500.000 records on a table frequently. Itīs a bulk
> insertion from my applicatoin.
> I am with a very poor performance. PostgreSQL insert very fast until
> the tuple 200.000 and after it the insertion starts to be really slow.
> I am seeing on the log and there is a lot of transaction logs,
> something like :
>
> 2004-12-04 11:08:59 LOG:  recycled transaction log file "0000000600000012"
> 2004-12-04 11:08:59 LOG:  recycled transaction log file "0000000600000013"
> 2004-12-04 11:08:59 LOG:  recycled transaction log file "0000000600000011"
> 2004-12-04 11:14:04 LOG:  recycled transaction log file "0000000600000015"
> 2004-12-04 11:14:04 LOG:  recycled transaction log file "0000000600000014"
> 2004-12-04 11:19:08 LOG:  recycled transaction log file "0000000600000016"
> 2004-12-04 11:19:08 LOG:  recycled transaction log file "0000000600000017"
> 2004-12-04 11:24:10 LOG:  recycled transaction log file "0000000600000018"

It is entirely normal for there to be a lot of transaction log file
recycling when bulk inserts are taking place; that goes through a lot
of transaction logs.

> How can I configure PostgreSQL to have a better performance on this
> bulk insertions ? I already increased the memory values.

Memory is, as likely as not, NOT the issue.

Two questions:

1.  How are you doing the inserts?  Via INSERT statements?  Or
    via COPY statements?  What sort of transaction grouping
    is involved?

    COPY is way faster than INSERT, and grouping plenty of updates
    into a single transaction is generally a "win."

2.  What is the schema like?  Does the table have a foreign key
    constraint?  Does it have a bunch of indices?

    If there should eventually be lots of indices, it tends to be
    faster to create the table with none/minimal indices, and add
    indexes afterwards, as long as your "load" process can be trusted
    to not break "unique" constraints...

    If there is some secondary table with a foreign key constraint,
    and _that_ table is growing, it is possible that a sequential
    scan is being used to search the secondary table where, if you
    did an ANALYZE on that table, an index scan would be preferred
    once it grew to larger size...

There isn't a particular reason for PostgreSQL to "hit a wall" upon
seeing 200K records; I and coworkers routinely load database dumps
that have millions of (sometimes pretty fat) records, and they don't
"choke."  That's true whether talking about loading things onto my
(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID
array, or higher end stuff involving high end SMP and EMC disk arrays.
The latter obviously being orders of magnitude faster than desktop
equipment :-).
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://www3.sympatico.ca/cbbrowne/unix.html
Rules of the  Evil Overlord #207. "Employees will  have conjugal visit
trailers which  they may use provided  they call in  a replacement and
sign out on  the timesheet. Given this, anyone caught  making out in a
closet  while  leaving  their   station  unmonitored  will  be  shot."
<http://www.eviloverlord.com/>

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings