Re: COPY enhancements

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 18:43:51
Message-ID: 603c8f070910081143h2232509agd137f4023a4c6315@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 1:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Another approach that was discussed earlier was to divvy the rows into
>>> batches.  Say every thousand rows you sub-commit and start a new
>>> subtransaction.  Up to that point you save aside the good rows somewhere
>>> (maybe a tuplestore).  If you get a failure partway through a batch,
>>> you start a new subtransaction and re-insert the batch's rows up to the
>>> bad row.  This could be pretty awful in the worst case, but most of the
>>> time it'd probably perform well.  You could imagine dynamically adapting
>>> the batch size depending on how often errors occur ...
>
>> Yeah, I think that's promising.  There is of course the possibility
>> that a row which previously succeeded could fail the next time around,
>> but most of the time that shouldn't happen, and it should be possible
>> to code it so that it still behaves somewhat sanely if it does.
>
> Actually, my thought was that failure to reinsert a previously good
> tuple should cause us to abort the COPY altogether.  This is a
> cheap-and-easy way of avoiding sorceror's apprentice syndrome.
> Suppose the failures are coming from something like out of disk space,
> transaction timeout, whatever ... a COPY that keeps on grinding no
> matter what is *not* ideal.

I think you handle that by putting a cap on the total number of errors
you're willing to accept (and in any event you'll always skip the row
that failed, so forward progress can't cease altogether). For out of
disk space or transaction timeout, sure, but you might also have
things like a serialization error that occurs on the reinsert that
didn't occur on the original. You don't want that to kill the whole
bulk load, I would think.

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-10-08 18:52:55 Re: Using results from INSERT ... RETURNING
Previous Message Pavel Stehule 2009-10-08 17:58:13 Re: Issues for named/mixed function notation patch