Re: Better way to bulk-load millions of CSV records into

From: Marc Spitzer <marc(at)oscar(dot)eng(dot)cv(dot)net>
To: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Better way to bulk-load millions of CSV records into
Date: 2002-05-22 18:11:22
Message-ID: 20020522141122.A38174@oscar.eng.cv.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> > > Have you looked at the COPY command is psql for this? There are the, 'FROM
> > > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> > > It seems to me that bulk loading is what they were designed for.
> > >
> > > ts
> >
> > for very large datasets that can be a problem, it is 1 transacrion and
> > the transaction logs must be kept until is finishes or aborts. This
> > can be a big disk hit.
> >
> > If it is just a plain csv file you can use split to make 1 file into
> > several smaller files and load each one seperatly.
>
> That's a possibility, but that would create _lots_ of little
> files... Much more convenient to have 1 big file, instead of
> 1 big file and many dozens of small files.

Well you do not keep the small files, just create, load, delete. And I
routeenly load 1 million+ rows of data using "copy into" from psql,
here is a sample:

for i in load_data/* ;do
echo "datafile $i"
awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
mv $i.tmp $i
grep -E "[0-9]+([.][0-9]+)+" $i
grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
mv $i.tmp $i
echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
done

This is part of a production job doing 1 million+ rows/day on 7.1.3 in
under 1 hour. after everything is done I clean up like this:

cat <<EOT|/usr/local/bin/psql call_me_bob

delete from ubr_info
where sampletime < current_date - '3 days'::interval;

vacuum analyze;
;; I think that the 7.2.X command would be vacuum full analize, or close
reindex table ubr_info;
EOT

reindexing also helps a lot, I delete a lot of rows daily.

This is running on a PII 450 with ide raid, os freebsd 4.4 rc, I realy
need to update that.
>
> > you can look at transactions and do roughly the same thing from
> > python, commit every 10,000 rows.
>
> That's exactly what I'm doing, but with commit count of 4,000.
>
> > the is a varable in config files that allows you to turn off flush
> > to disk. If you do that for the load you will have better load
> > speed. Turn it back on when you are done with the load.
>
> This is presuming that one could bounce postmaster (not always
> feasible). Also, as many have said, if An Accident Happens,
> and postmaster terminates for what ever reason, you are left
> with a corrupt database, and must reload _everything_ from
> _every_ table. Blech...
>

Well you could always do it in C, python is slow, from
what I here.

good luck

marc

> > and read the admin guide twice, it will help.

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Phillip J. Allen 2002-05-22 18:29:01 How to Identify a SERIAL column type?
Previous Message stev knowles 2002-05-22 18:01:19 SELECT DISTINCT