From: | Lee Kindness <lkindness(at)csl(dot)co(dot)uk> |
---|---|
To: | Peter Eisentraut <peter_e(at)gmx(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, "Jim Buttafuoco" <jim(at)buttafuoco(dot)net>, <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Bulkloading using COPY - ignore duplicates? |
Date: | 2001-10-03 10:01:30 |
Message-ID: | 15290.57850.68004.391312@elsick.csl.co.uk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Peter Eisentraut writes:
> However, it seems to me that COPY ignoring duplicates can easily be
> done by preprocessing the input file.
Or by post-processing, like (error checking cut):
void import_shots(char *impfile, int lineshoot_id)
{
char tab_name[128];
char tab_temp[128];
frig_file(impfile); /* add the postgres header */
sprintf(tab_name, "shot_%d", lineshoot_id);
sprintf(tab_temp, "shot_%d_tmp", lineshoot_id);
sprintf(cmd, "CREATE TEMPORARY TABLE %s AS SELECT * FROM shot",
tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
EXEC SQL COMMIT WORK; /* will not work without comit here! */
sprintf(cmd, "COPY BINARY %s FROM '%s'", tab_temp, impfile);
append_page_alloc(cmd, tab_name, impfile, 1);
EXEC SQL EXECUTE IMMEDIATE :cmd;
sprintf(cmd, "INSERT INTO %s SELECT DISTINCT ON(shot_time) * FROM %s",
tab_name, tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
sprintf(cmd, "DROP TABLE %s", tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
EXEC SQL COMMIT WORK ;
remove(impfile);
}
However this is adding significant time to the import
operation. Likewise I could loop round the input file first and hunt
for duplicates, again with a performance hit.
My main point is that Postgres can easily and quickly check for
duplicates during the COPY (as it does currently) and it adds zero
execution time to simply ignore these duplicate rows. Obviously this
is a useful feature otherwise Oracle, Ingres and other commercial
relational databases wouldn't feature similiar functionality.
Yes, in an ideal world the input to COPY should be clean and
consistent with defined indexes. However this is only really the case
when COPY is used for database/table backup and restore. It misses the
point that a major use of COPY is in speed optimisation on bulk
inserts...
Lee.
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2001-10-03 13:00:05 | Re: btree_gist regression test busted? |
Previous Message | Tatsuo Ishii | 2001-10-03 09:30:01 | Re: Unicode combining characters |