Re: Bulkloading using COPY - ignore duplicates?

From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, "Jim Buttafuoco" <jim(at)buttafuoco(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-10-03 10:01:30
Message-ID: 15290.57850.68004.391312@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut writes:
> However, it seems to me that COPY ignoring duplicates can easily be
> done by preprocessing the input file.

Or by post-processing, like (error checking cut):

void import_shots(char *impfile, int lineshoot_id)
{
char tab_name[128];
char tab_temp[128];

frig_file(impfile); /* add the postgres header */
sprintf(tab_name, "shot_%d", lineshoot_id);
sprintf(tab_temp, "shot_%d_tmp", lineshoot_id);

sprintf(cmd, "CREATE TEMPORARY TABLE %s AS SELECT * FROM shot",
tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
EXEC SQL COMMIT WORK; /* will not work without comit here! */

sprintf(cmd, "COPY BINARY %s FROM '%s'", tab_temp, impfile);
append_page_alloc(cmd, tab_name, impfile, 1);
EXEC SQL EXECUTE IMMEDIATE :cmd;

sprintf(cmd, "INSERT INTO %s SELECT DISTINCT ON(shot_time) * FROM %s",
tab_name, tab_temp);

EXEC SQL EXECUTE IMMEDIATE :cmd;

sprintf(cmd, "DROP TABLE %s", tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;

EXEC SQL COMMIT WORK ;
remove(impfile);
}

However this is adding significant time to the import
operation. Likewise I could loop round the input file first and hunt
for duplicates, again with a performance hit.

My main point is that Postgres can easily and quickly check for
duplicates during the COPY (as it does currently) and it adds zero
execution time to simply ignore these duplicate rows. Obviously this
is a useful feature otherwise Oracle, Ingres and other commercial
relational databases wouldn't feature similiar functionality.

Yes, in an ideal world the input to COPY should be clean and
consistent with defined indexes. However this is only really the case
when COPY is used for database/table backup and restore. It misses the
point that a major use of COPY is in speed optimisation on bulk
inserts...

Lee.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-10-03 13:00:05 Re: btree_gist regression test busted?
Previous Message Tatsuo Ishii 2001-10-03 09:30:01 Re: Unicode combining characters