Re: Bulkloading using COPY - ignore duplicates?

From: Thomas Swan <tswan(at)olemiss(dot)edu>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-10-01 14:21:22
Message-ID: 3BB87BE2.3000406@olemiss.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Lee Kindness wrote:

>Tom Lane writes:
> > Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > > Would this seem a reasonable thing to do? Does anyone rely on COPY
> > > FROM causing an ERROR on duplicate input?
> > Yes. This change will not be acceptable unless it's made an optional
> > (and not default, IMHO, though perhaps that's negotiable) feature of
> > COPY.
>
>I see where you're coming from, but seriously what's the use/point of
>COPY aborting and doing a rollback if one duplicate key is found? I
>think it's quite reasonable to presume the input to COPY has had as
>little processing done on it as possible. I could loop through the
>input file before sending it to COPY but that's just wasting cycles
>and effort - Postgres has btree lookup built in, I don't want to roll
>my own before giving Postgres my input file!
>
> > The implementation might be rather messy too. I don't much care
> > for the notion of a routine as low-level as bt_check_unique knowing
> > that the context is or is not COPY. We might have to do some
> > restructuring.
>
>Well in reality it wouldn't be "you're getting run from copy" but
>rather "notice on duplicate, rather than error & exit". There is a
>telling comment in nbtinsert.c just before _bt_check_unique() is
>called:
>
> /*
> * If we're not allowing duplicates, make sure the key isn't already
> * in the index. XXX this belongs somewhere else, likely
> */
>
>So perhaps dupes should be searched for before _bt_doinsert is called,
>or somewhere more appropriate?
>
> > > Would:
> > > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> > > need to be added to the COPY command (I hope not)?
> > It occurs to me that skip-the-insert might be a useful option for
> > INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
> > the regular discussions we see on whether to do INSERT first or
> > UPDATE first when the key might already exist.) Maybe a SET variable
> > that applies to all forms of insertion would be appropriate.
>
>That makes quite a bit of sense.
>
This is tring to avoid one step.

IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.

A. You can validate your data before you put it into your permanent table.
B. This doesn't cost you much.

Don't make the assumption that bulk copies have not been checked or
validated. The assumption should be correct data or you shouldn't be
using COPY.

>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2001-10-01 14:37:41 cvs problem
Previous Message Lee Kindness 2001-10-01 14:17:43 Re: Bulkloading using COPY - ignore duplicates?