I do mass inserts daily into PG. I drop the all indexes except my primary key and then use the COPY FROM command. This usually takes less than 30 seconds. I spend more time waiting for indexes to recreate.

Patrick Hatcher
Macys.Com

-----pgsql-performance-owner@postgresql.org wrote: -----

To: pgsql-performance@postgresql.org
From: Christopher Browne <cbbrowne@acm.org>
Sent by: pgsql-performance-owner@postgresql.org
Date: 2004-12-04 06:48AM
Subject: Re: [PERFORM] Improve BULK insertion

In the last exciting episode, grupos@carvalhaes.net (Grupos) wrote:
> Hi !
>
> I need to insert 500.000 records on a table frequently. It´s a bulk
> insertion from my applicatoin.
> I am with a very poor performance. PostgreSQL insert very fast until
> the tuple 200.000 and after it the insertion starts to be really slow.
> I am seeing on the log and there is a lot of transaction logs,
> something like :
>
> 2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000012"
> 2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000013"
> 2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000011"
> 2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000015"
> 2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000014"
> 2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000016"
> 2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000017"
> 2004-12-04 11:24:10 LOG: recycled transaction log file "0000000600000018"

It is entirely normal for there to be a lot of transaction log file
recycling when bulk inserts are taking place; that goes through a lot
of transaction logs.

> How can I configure PostgreSQL to have a better performance on this
> bulk insertions ? I already increased the memory values.

Memory is, as likely as not, NOT the issue.

Two questions:

1. How are you doing the inserts? Via INSERT statements? Or
via COPY statements? What sort of transaction grouping
is involved?

COPY is way faster than INSERT, and grouping plenty of updates
into a single transaction is generally a "win."

2. What is the schema like? Does the table have a foreign key
constraint? Does it have a bunch of indices?

If there should eventually be lots of indices, it tends to be
faster to create the table with none/minimal indices, and add
indexes afterwards, as long as your "load" process can be trusted
to not break "unique" constraints...

If there is some secondary table with a foreign key constraint,
and _that_ table is growing, it is possible that a sequential
scan is being used to search the secondary table where, if you
did an ANALYZE on that table, an index scan would be preferred
once it grew to larger size...

There isn't a particular reason for PostgreSQL to "hit a wall" upon
seeing 200K records; I and coworkers routinely load database dumps
that have millions of (sometimes pretty fat) records, and they don't
"choke." That's true whether talking about loading things onto my
(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID
array, or higher end stuff involving high end SMP and EMC disk arrays.
The latter obviously being orders of magnitude faster than desktop
equipment :-).
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://www3.sympatico.ca/cbbrowne/unix.html
Rules of the Evil Overlord #207. "Employees will have conjugal visit
trailers which they may use provided they call in a replacement and
sign out on the timesheet. Given this, anyone caught making out in a
closet while leaving their station unmonitored will be shot."
<http://www.eviloverlord.com/>

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings