Re: Bulkloading using COPY - ignore duplicates?

From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-18 15:51:50
Message-ID: 15391.26134.348061.302157@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane writes:
> Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
> > using it as an example of where the database is choosing (be it
> > randomly) the data to discarded.
> Not a good example to support your argument. The entire point of
> DISTINCT ON (imho) is that the rows that are kept or discarded are
> *not* random, but can be selected by the user by specifying additional
> sort columns. DISTINCT ON would be pretty useless if it weren't for
> that flexibility. The corresponding concept in COPY will need to
> provide flexible means for deciding which row to keep and which to
> drop, else it'll be pretty useless.

At which point it becomes quicker to resort to INSERT...

Here's the crux question - how can I get management to go with
PostgreSQL when a core operation (import of data into a transient
database) is at least 6 times slower than the current version?

With a lot of work investigating the incoming data, the number of
incoming duplicates has been massively reduced by fixing/tackling at
source. However rouge values do still crop up (the data originates
from a real-time system with multiple hardware inputs from multiple
hardware vendors) and when they do (even just 1) the performance dies.
Add to this terrabytes of legacy data...

While you may see the option of ignoring duplicates in COPY as 'pretty
useless', it obviously has its place/use otherwise every other
database system wouldn't have support for it! (not that following the
pack is always a good idea)

In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!

Best regards,

--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-12-18 15:53:59 Re: Concerns about this release
Previous Message Tom Lane 2001-12-18 15:27:08 Re: FreeBSD/alpha