Re: Better way to bulk-load millions of CSV records into

From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Better way to bulk-load millions of CSV records into
Date: 2002-05-22 17:48:58
Message-ID: 1022089739.19119.34.camel@rebel
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> > Have you looked at the COPY command is psql for this? There are the, 'FROM
> > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> > It seems to me that bulk loading is what they were designed for.
> >
> > ts
>
> for very large datasets that can be a problem, it is 1 transacrion and
> the transaction logs must be kept until is finishes or aborts. This
> can be a big disk hit.
>
> If it is just a plain csv file you can use split to make 1 file into
> several smaller files and load each one seperatly.

That's a possibility, but that would create _lots_ of little
files... Much more convenient to have 1 big file, instead of
1 big file and many dozens of small files.

> you can look at transactions and do roughly the same thing from
> python, commit every 10,000 rows.

That's exactly what I'm doing, but with commit count of 4,000.

> the is a varable in config files that allows you to turn off flush
> to disk. If you do that for the load you will have better load
> speed. Turn it back on when you are done with the load.

This is presuming that one could bounce postmaster (not always
feasible). Also, as many have said, if An Accident Happens,
and postmaster terminates for what ever reason, you are left
with a corrupt database, and must reload _everything_ from
_every_ table. Blech...

> and read the admin guide twice, it will help.

> >
> > ----- Original Message -----
> > From: "Ron Johnson" <ron(dot)l(dot)johnson(at)cox(dot)net>
> > To: "PgSQL Novice ML" <pgsql-novice(at)postgresql(dot)org>
> > Sent: Tuesday, May 21, 2002 3:40 PM
> > Subject: [NOVICE] Better way to bulk-load millions of CSV records into
> > postgres?
> >
> >
> > >
> > > Hi,
> > >
> > > Currently, I've got a python script using pyPgSQL that
> > > parses the CSV record, creates a string that is a big
> > > "INSERT INTO VALUES (...)" command, then, execute() it.
> > >
> > > top shows that this method uses postmaster with ~70% CPU
> > > utilization, and python with ~15% utilization.
> > >
> > > Still, it's only inserting ~190 recs/second. Is there a
> > > better way to do this, or am I constrained by the hardware?
> > >
> > > Instead of python and postmaster having to do a ton of data
> > > xfer over sockets, I'm wondering if there's a way to send a
> > > large number of csv records (4000, for example) in one big
> > > chunk to a stored procedure and have the engine process it
> > > all.
> > >
> > > Linux 2.4.18
> > > PostgreSQL 7.2.1
> > > python 2.1.3
> > > csv file on /dev/hda
> > > table on /dev/hde (ATA/100)

--
+---------------------------------------------------------+
| Ron Johnson, Jr. Home: ron(dot)l(dot)johnson(at)cox(dot)net |
| Jefferson, LA USA http://ronandheather.dhs.org:81 |
| |
| "I have created a government of whirled peas..." |
| Maharishi Mahesh Yogi, 12-May-2002, |
! CNN, Larry King Live |
+---------------------------------------------------------+

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message stev knowles 2002-05-22 18:01:19 SELECT DISTINCT
Previous Message Ron Johnson 2002-05-22 17:40:03 Re: optimising data load