Re: moving CSV data into table?

From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: James Hartley <jjhartley(at)gmail(dot)com>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: moving CSV data into table?
Date: 2011-09-02 18:16:15
Message-ID: 4E611D6F.20708@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On 09/02/2011 10:08 AM, James Hartley wrote:
> I need to ask about best practices. I now have periodic bulk CSV data
> needing to be added to an existing table. I assume it is best to
> first move the data into a temporary table & scrub it before moving
> into the final table. However, the incoming data does not have the
> same schema as the permanent target as additional attributes (columns)
> have been added. Is it better to massage the temporary table into the
> same target's schema first, & use COPY to copy all temporary contents
> into the permanent table, or is scripting a better solution? Is there
> a better method?
>
> Your insights would be appreciated.
>
> Thanks.
As always the answer is, "it depends". Is performance a concern? What
type of scrubbing must be done? Does scrubbing require comparison to
existing database data?Is lights-out automation required? A combination
of the above?

You may find it preferable to do pre-scrubbing (sed/grep/awk type of
stuff that doesn't require access to existing data in the database)
prior to initial import.

I would typically create a temporary table that matches the structure of
the data you are bringing in then craft a query or set of queries to
scrub and import the data.

Notes:

You don't need to alter the temporary table structure to copy data -
just select the data you want. Suppose your temporary import table has 5
columns a,b,c,d and e but your permanent table has only a, b and c. The
import is basically:
insert into permtable (a, b, c) select a, b, c from temptable;

The select can be as complicated as you want. You can eliminate
duplicates, perform calculations, add static data like a batch number,
etc. For example, the following would import "a" unchanged, put b+d from
the temp table into b in the permanent table, import "c" unchanged and
put the integer 1 into batchnum. It also eliminates records in the
import table where e is zero and where a would create a duplicate a in
the permanent table:
insert into permtable (a, b, c, batchnum) select a, b+d, c, 1::int from
temptable t where e !=0 and not exists (select 1 from permtable p where
p.a = t.a);

Also note that for performance, the temporary table should really be
temporary (i.e. create temporary table foo...). Temporary tables are
"unlogged", visible only to the current connection and deleted when the
connection closes. Since they are unlogged, you eliminate the
performance cost of maintaining the write-ahead logs. (One upcoming
feature is that you can specify a permanent table is unlogged - useful
for certain tables where data-loss is not a problem.)

Cheers,
Steve

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message James Hartley 2011-09-02 19:07:59 Re: moving CSV data into table?
Previous Message James Hartley 2011-09-02 17:08:21 moving CSV data into table?