Re: COPY enhancements

From: Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:12:40
Message-ID: 751261b20910080912md3176e8ga671c768b4fee055@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Yeah. I think it's going to be hard to make this work without having
> standalone transactions. One idea would be to start a subtransaction,
> insert tuples until one fails, then rollback the subtransaction and
> start a new one, and continue on until the error limit is reached.
>

I've found performance is reasonable, for data with low numbers of errors
(say 1 per 100,000 records or less) doing the following:

SAVEPOINT bulk;
Insert 1000 records using COPY.

If there is an error, rollback to bulk, and step through each line
individually within its own "individual" subtransaction. All good lines are
kept and bad lines are logged; client side control makes logging trivial.

The next set of 1000 records is done in bulk again.

1000 records per savepoint seems to be a good point for my data without too
much time lost to overhead or too many records to retry due to a failing
record. Of course, it is controlled by the client side rather than server
side so reporting back broken records is trivial.

It may be possible to boost performance by:

1) Having copy remember which specific line caused the error. So it can
replace lines 1 through 487 in a subtransaction since it knows those are
successful. Run 488 in its on subtransaction. Run 489 through ... in a new
subtransaction.
2) Increasing the number of records per subtransaction if data is clean. It
wouldn't take long until you were inserting millions of records per
subtransaction for a large data set. This should make the subtransaction
overhead minimal. Small imports would still run slower but very large
imports of clean data should be essentially the same speed in the end.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2009-10-08 16:15:47 Re: Writeable CTEs and side effects
Previous Message Dan Colish 2009-10-08 16:03:48 one line comment style