Bulkloading using COPY - ignore duplicates?

From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Cc: lkindness(at)csl(dot)co(dot)uk
Subject: Bulkloading using COPY - ignore duplicates?
Date: 2001-10-08 12:41:27
Message-ID: 15297.40695.914645.327132@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Guys,

I've made some inroads towards adding 'ignore duplicates'
functionality to PostgreSQL's COPY command. I've updated the parser
grammar for COPY FROM to now accept:

COPY [ BINARY ] table [ WITH OIDS ]
FROM { 'filename' | stdin }
[ [USING] DELIMITERS 'delimiter' ]
[ WITH [NULL AS 'null string']
[IGNORE DUPLICATES] ]

and added code to propagate this setting down to the CopyFrom function
in backend/commands/copy.c.

I also played around with _bt_check_unique, _bt_do_insert and btinsert
to return NULL on duplicate rather than elog(ERROR). Likewise
ExecInsertIndexTuples and index_insert were passed the
ignore_duplicate flag and index_insert changed to elog(ERROR) if the
return from the insert function was NULL and ignore_duplicate flag was
false.

These changes worked and gave the desired result for the COPY FROM
command, however as many mentioned these changes are far too low
level... After assessing the situation more fully, I believe the
following change in CopyFrom would be more suitable:

/* BEFORE ROW INSERT Triggers */
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->n_before_row[TRIGGER_EVENT_INSERT] > 0)
{
HeapTuple newtuple;
newtuple = ExecBRInsertTriggers(estate, resultRelInfo, tuple);

if (newtuple == NULL) /* "do nothing" */
skip_tuple = true;
else if (newtuple != tuple) /* modified by Trigger(s) */
{
heap_freetuple(tuple);
tuple = newtuple;
}
}

/* new code */
if( ignore_duplicates == true )
{
if( duplicate index value )
skip_tuple = true;
}

if (!skip_tuple)
{

Now I imagine 'duplicate index value' would be functionally similar to
_bt_check_unique but obviously higher level. Is there any existing
code with the functionality I desire? Can anyone point me in the right
way...

Thanks,

Lee Kindness.

Lee Kindness writes:
> I'm in the process of porting a large application from Ingres to
> PostgreSQL. We make heavy use of bulkloading using the 'COPY'
> statement in ESQL/C. Consider the SQL statements below (in a psql
> session on an arbitrary database):
>
> CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);
> CREATE UNIQUE INDEX copytest_idx ON copytest USING BTREE(f1, f2);
> COPY copytest FROM '/tmp/copytest';
>
> Given the file /tmp/copytest:
>
> 1 1
> 2 2
> 3 3
> 4 4
> 4 4
> 5 5
> 6 6
>
> will result in the following output:
>
> ERROR: copy: line 5, Cannot insert a duplicate key into unique index copytest_idx
>
> However my application code is assuming that duplicate rows will
> simply be ignored (this is the case in Ingres, and I believe Oracle's
> bulkloader too). I propose modifying _bt_check_unique() in
> /backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
> ERROR) elog() and return NULL (or appropriate) to the calling function
> if a duplicate key is detected and a 'COPY FROM' is in progress (add
> new parameter to flag this).
>
> Would this seem a reasonable thing to do? Does anyone rely on COPY
> FROM causing an ERROR on duplicate input? Would:
>
> WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
>
> need to be added to the COPY command (I hope not)?
>
> Thanks,
>
> --
> Lee Kindness, Senior Software Engineer
> Concept Systems Limited.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Horak Daniel 2001-10-08 12:55:49 Re: Bulkloading using COPY - ignore duplicates?
Previous Message Kelly Harmon 2001-10-08 08:00:44 Re: Accessing Database files on a "read-only" medium...like a CD.