Re: Populating large tables with occasional bad values

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: "John T(dot) Dow" <john(at)johntdow(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Populating large tables with occasional bad values
Date: 2008-06-11 17:22:14
Message-ID: 485009C6.3080802@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

John T. Dow wrote:

> If I did this regularly, three options seem easiest.
>
> 1 - Load locally to get clean data and then COPY. This requires the server to have access local access to the file to be copied, and if the server is hosted by an isp, it depends on them whether you can do this easily.

You can COPY over a PostgreSQL network connection. See the \copy support
in psql for one example of how it works.

I've never had cause to try to use it with JDBC so I don't know how/if
it works in that context.

However, batched inserts can be quite fast enough. If you do one INSERT
INTO per 1000 rows you'll already be seeing a massive performance boost:

INSERT INTO mytable
VALUES
(blah, blah),
(blah, blah),
(blah, blah),
-- hundreds of rows later
(blah, blah);

... will be a *LOT* faster. If you have to do special processing or
error handling you can do it once you have the data in a server-side
staging table - and you can get it there quickly with multi-row inserts
or (at least using psql) a \copy .

> 3 - If the only problem is duplicate keys, load into a special table without the constraint, issue update commands to rewrite the keys as needed, then select/insert to the correct table.

This would be my preferred approach, personally, using either network
COPY or multi-row INSERTs to get the data into the staging table.

You can do a whole lot more than fix unique constaint violations,
though. With a PL/PgSQL function to process the staging table and do the
inserts you can do some pretty major data surgery.

I have some conversion code that takes pretty ugly data in unconstrained
staging tables and reprocesses it to fit a new, much stricter and better
normalized schema. Large amounts of restructuring and fixing up are
required. Some of the simpler conversions are done by INSERT ... SELECT
statements, but the complicated stuff is done with PL/PgSQL functions.

I've sent you the conversion code off-list in case it's informative.
It's pretty ugly code as it's going to be thrown away when we cut over
to the new system, but it should show just how much data conversion &
repair you can do on the database server side.

It's not like you can't put aside rows that need user interaction as you
process the staging table, either. Just INSERT them into a "problem"
table and delete them from the staging table. Then have the client scan
through the (much smaller) problem table and ask the user to make
decisions. When everything looks satisfactory and the user's decisions
have been acted on, COMMIT.

--
Craig Ringer

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Oliver Jowett 2008-06-12 01:23:50 Re: Populating large tables with occasional bad values
Previous Message John T. Dow 2008-06-11 16:39:36 Re: Populating large tables with occasional bad values