Re: Bulkloading using COPY - ignore duplicates?

From: "Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Patrick Welche <prlw1(at)newn(dot)cam(dot)ac(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-13 16:22:59
Message-ID: 20011213102259.B10349@rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 13, 2001 at 03:44:31PM +0000, Lee Kindness wrote:
> Patrick Welche writes:
> > On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
> > > That's what I'm currently doing as a workaround - a SELECT DISTINCT
> > > from a temporary table into the real table with the unique index on
> > > it. However this takes absolute ages - say 5 seconds for the copy
> > > (which is the ballpark figure I aiming toward and can achieve with
> > > Ingres) plus another 30ish seconds for the SELECT DISTINCT.
> > Then your column really isn't unique,
>
> That's another discussion entirely ;) - it's spat out by a real-time
> system which doesn't have the time or resources to check this. Further
> precision loss later in the data's life adds more duplicates...

Hmm, the data has a later life - sounds like you'll need to remove dups
then, anyway, so can you get away with just letting the dups in? Remove
the UNIQUE requirement, and let the real time system just dump away.
How critical is it to later steps that there be no dups? And how many
(potential) dups is your RTS producing, anyway?

Your later processing (which apparently can _generate_ dups) might be
the out of the critical time path place to worry about removing dups.

Ross

P.S. This falls into the class of problem solving characterized by
"if you can't solve the problem as stated, restate the problem to be
one you _can_ solve" ;-)

>
> > so how about dropping the unique index, import the data, fix the
> > duplicates, recreate the unique index - just as another possible
> > work around ;)
>
> This is just going to be the same(ish) time, no?
>
> CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);
> COPY tab FROM 'file';
> DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2
> FROM tab);
> CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2);
>
> or am I missing something?
>
> Thanks, Lee.
>
> --
> Lee Kindness, Senior Software Engineer, Concept Systems Limited.
> http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Doug McNaught 2001-12-13 17:17:44 Re: Platform testing (last call?)
Previous Message Larry Rosenman 2001-12-13 16:22:53 Re: Platform testing (last call?)