Re: Bulkloading using COPY - ignore duplicates?

From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-10-01 13:54:25
Message-ID: 15288.30097.74590.206271@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane writes:
> Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > Would this seem a reasonable thing to do? Does anyone rely on COPY
> > FROM causing an ERROR on duplicate input?
> Yes. This change will not be acceptable unless it's made an optional
> (and not default, IMHO, though perhaps that's negotiable) feature of
> COPY.

I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!

> The implementation might be rather messy too. I don't much care
> for the notion of a routine as low-level as bt_check_unique knowing
> that the context is or is not COPY. We might have to do some
> restructuring.

Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:

/*
* If we're not allowing duplicates, make sure the key isn't already
* in the index. XXX this belongs somewhere else, likely
*/

So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?

> > Would:
> > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> > need to be added to the COPY command (I hope not)?
> It occurs to me that skip-the-insert might be a useful option for
> INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
> the regular discussions we see on whether to do INSERT first or
> UPDATE first when the key might already exist.) Maybe a SET variable
> that applies to all forms of insertion would be appropriate.

That makes quite a bit of sense.

--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-10-01 14:02:54 Re: Bulkloading using COPY - ignore duplicates?
Previous Message Justin Clift 2001-10-01 13:48:19 Re: Spinlock performance improvement proposal