Skip site navigation (1) Skip section navigation (2)

Re: Populating large tables with occasional bad values

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: "John T(dot) Dow" <john(at)johntdow(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Populating large tables with occasional bad values
Date: 2008-06-11 17:22:14
Message-ID: 485009C6.3080802@postnewspapers.com.au (view raw or flat)
Thread:
Lists: pgsql-jdbc
John T. Dow wrote:

> If I did this regularly, three options seem easiest.
> 
> 1 - Load locally to get clean data and then COPY. This requires the server to have access local access to the file to be copied, and if the server is hosted by an isp, it depends on them whether you can do this easily.

You can COPY over a PostgreSQL network connection. See the \copy support 
in psql for one example of how it works.

I've never had cause to try to use it with JDBC so I don't know how/if 
it works in that context.

However, batched inserts can be quite fast enough. If you do one INSERT 
INTO per 1000 rows you'll already be seeing a massive performance boost:

INSERT INTO mytable
VALUES
(blah, blah),
(blah, blah),
(blah, blah),
-- hundreds of rows later
(blah, blah);

... will be a *LOT* faster. If you have to do special processing or 
error handling you can do it once you have the data in a server-side 
staging table - and you can get it there quickly with multi-row inserts 
or (at least using psql) a \copy .

> 3 - If the only problem is duplicate keys, load into a special table without the constraint, issue update commands to rewrite the keys as needed, then select/insert to the correct table.

This would be my preferred approach, personally, using either network 
COPY or multi-row INSERTs to get the data into the staging table.

You can do a whole lot more than fix unique constaint violations, 
though. With a PL/PgSQL function to process the staging table and do the 
inserts you can do some pretty major data surgery.

I have some conversion code that takes pretty ugly data in unconstrained 
staging tables and reprocesses it to fit a new, much stricter and better 
normalized schema. Large amounts of restructuring and fixing up are 
required. Some of the simpler conversions are done by INSERT ... SELECT 
statements, but the complicated stuff is done with PL/PgSQL functions.

I've sent you the conversion code off-list in case it's informative. 
It's pretty ugly code as it's going to be thrown away when we cut over 
to the new system, but it should show just how much data conversion & 
repair you can do on the database server side.

It's not like you can't put aside rows that need user interaction as you 
process the staging table, either. Just INSERT them into a "problem" 
table and delete them from the staging table. Then have the client scan 
through the (much smaller) problem table and ask the user to make 
decisions. When everything looks satisfactory and the user's decisions 
have been acted on, COMMIT.

--
Craig Ringer

In response to

Responses

pgsql-jdbc by date

Next:From: Oliver JowettDate: 2008-06-12 01:23:50
Subject: Re: Populating large tables with occasional bad values
Previous:From: John T. DowDate: 2008-06-11 16:39:36
Subject: Re: Populating large tables with occasional bad values

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group