Re: Bulkloading using COPY - ignore duplicates?

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-02 21:09:36
Message-ID: 200201022109.g02L9aW27520@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Lee Kindness wrote:
> Tom Lane writes:
> > Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > > In an ideal world 'COPY FROM' would only be used with data output by
> > > 'COPY TO' and it would be nice and sanitised. However in some fields
> > > this often is not a possibility due to performance constraints!
> > Of course, the more bells and whistles we add to COPY, the slower it
> > will get, which rather defeats the purpose no?
>
> Indeed, but as I've mentioned in this thread in the past, the code
> path for COPY FROM already does a check against the unique index (if
> there is one) but bombs-out rather than handling it...
>
> It wouldn't add any execution time if there were no duplicates in the
> input!

I know many purists object to allowing COPY to discard invalid rows in
COPY input, but it seems we have lots of requests for this feature, with
few workarounds except pre-processing the flat file. Of course, if they
use INSERT, they will get errors that they can just ignore. I don't see
how allowing errors in COPY is any more illegal, except that COPY is one
command while multiple INSERTs are separate commands.

Seems we need to allow such a capability, if only crudely. I don't
think we can create a discard file because of the problem with remote
COPY.

I think we can allow something like:

COPY FROM '/tmp/x' WITH ERRORS 2

meaning we will allow at most two errors and will report the error line
numbers to the user. I think this syntax clearly indicates that errors
are being accepted in the input. An alternate syntax would allow an
unlimited number of errors:

COPY FROM '/tmp/x' WITH ERRORS

The errors can be non-unique errors, or even CHECK constraint errors.

Unless I hear complaints, I will add it to TODO.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurette Cisneros 2002-01-02 21:40:32 bug in join?
Previous Message Hannu Krosing 2002-01-02 21:09:14 Re: problems with new vacuum (??)