Re: optimizing import of large CSV file into partitioned table?

From: Thom Brown <thombrown(at)gmail(dot)com>
To: Rick Casey <caseyrick(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: optimizing import of large CSV file into partitioned table?
Date: 2010-03-28 19:03:42
Message-ID: bddc86151003281203l14e9d586ha5f20f226891884d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 28 March 2010 18:33, Rick Casey <caseyrick(at)gmail(dot)com> wrote:

> After careful research, I would to post the following problem I'm having
> with the importing of a large (16Gb) CSV file. Here is brief synopsis:
> - this is running on Postgres (PG) version: PostgreSQL 8.3.9 on
> i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu 4.3.2-1ubuntu11)
> 4.3.2
> - it is running on a Ubuntu (small) server instance at Amazon Web Services
> (AWS), with a 320Gb volume mounted for the PG data directory
> - the database was created using the partition example in the
> documentation, with an insert trigger and a function to direct which table
> where records get inserted.
> (see below for code on my table and trigger creation)
>
> After some days of attempting to import the full 16Gb CSV file, I decided
> to split the thing up, using the split utility in Linux. This seemed to
> improve things; once I had split the CSV files into about 10Mb size files, I
> finally got my first successful import of about 257,000 recs. However, this
> is going to be a rather labor intensive process to import the full 16Gb
> file, if I have to manually split it up, and import each smaller file
> separately.
>
> So, I am wondering if there is any to optimize this process? I have been
> using Postgres for several years, but have never had to partition or
> optimize it for files of this size until now.
> Any comments or suggestions would be most welcomed from this excellent
> forum.
>
> (I might add that I spend several weeks prior to this trying to get this to
> work in MySQL, which I finally had to abandon.)
>
> Sincerely,
> Rick
>
> Details of the code follow:
>
> Here is the basic COPY command, which I run as the postgres user, to import
> the CSV files:
> <begin>
> COPY allcalls FROM '/data/allcalls-csv/sub3ab' WITH CSV;
>
> Here is what some sample data look like in the files:
> 3153371867,2008-02-04 16:11:00,1009,1,40
> 2125673062,2008-02-04 16:11:00,1009,1,41
> 5183562377,2008-02-04 16:11:00,1009,1,50
> ...
>
> Here are the basic scripts that created the partition table and insert
> trigger:
> CREATE TABLE allcalls (
> phonenum bigint,
> callstarted timestamp without time zone,
> status int,
> attempts int,
> duration int
> );
> CREATE TABLE allcalls_0 (
> CHECK ( phonenum < 1000000000 )
> ) INHERITS (allcalls);
> ...(repeat this 9 more times, for 10 subpartition tables)
>
> CREATE INDEX allcalls_0_phonenum ON allcalls_0 (phonenum);
> ..(repeat this 9 more times, for indexes on the 10 subpartition tables)
> CREATE OR REPLACE FUNCTION allcalls_insert_trigger()
> RETURNS TRIGGER AS $$
> BEGIN
> IF ( NEW.phonenum < 1000000000 ) THEN
> INSERT INTO allcalls_0 VALUES (NEW.*);
> ELSIF ( NEW.phonenum >= 1000000000 AND NEW.phonenum < 2000000000 ) THEN
> INSERT INTO allcalls_1 VALUES (NEW.*);
> ...(again, repeat for rest of the parition tables)
>
> CREATE TRIGGER insert_phonenum_trigger
> BEFORE INSERT ON allcalls
> FOR EACH ROW EXECUTE PROCEDURE allcalls_insert_trigger();
>
> <end>
>

The problem here is that you appear to require an index update, trigger
firing and constraint check for every single row. First thing I'd suggest
is remove the indexes. Apply that after your import, otherwise it'll have
to update the index for every single entry. And the trigger won't help
either. Import into a single table and split it out into further tables
after if required. And finally the constraint should probably be applied
after too, so cull any violating rows after importing.

Thom

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Faheem Mitha 2010-03-28 20:05:31 Re: simultaneously reducing both memory usage and runtime for a query
Previous Message Scott Marlowe 2010-03-28 18:55:43 Re: How to generate a valid postgre TIMESTAMP with PHP?