The deduplication process requires so many programmed procedures that it
runs on the client. Most of the de-dupe lookups are not "straight" lookups,
but calculated ones emplying fuzzy logic. This is because we cannot dictate
the format of our input data and must deduplicate with what we get.
This was one of the reasons why I went with PostgreSQL in the first place,
because of the server-side programming options. However, I saw incredible
performance hits when running processes on the server and I partially
abandoned the idea (some custom-buiilt name-comparison functions still run
on the server).
I am using Tcl on both the server and the client. I'm not a fan of Tcl, but
it appears to be quite well implemented and feature-rich in PostgreSQL. I
find PL/pgsql awkward - even compared to Tcl. (After all, I'm just a
programmer... we do tend to be a little limited.)
The import program actually runs on the server box as a db client and
involves about 3000 lines of code (and it will certainly grow steadily as we
add compatability with more import formats). Could a process involving that
much logic run on the db server, and would there really be a benefit?
""Jim C. Nasby"" <jim(at)nasby(dot)net> wrote in message
> On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote:
>> > are you using the 'copy' interface?
>> Straightforward inserts - the import data has to transformed, normalised
>> de-duped by the import program. I imagine the copy interface is for more
>> straightforward data importing. These are - buy necessity - single row
> BTW, stuff like de-duping is something you really want the database -
> not an external program - to be doing. Think about loading the data into
> a temporary table and then working on it from there.
> Jim Nasby jim(at)nasby(dot)net
> EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
In response to
pgsql-performance by date
|Next:||From: Andrew Sullivan||Date: 2006-09-28 20:17:10|
|Subject: Re: slow queue-like empty table|
|Previous:||From: Merlin Moncure||Date: 2006-09-28 20:06:56|
|Subject: Re: Performace Optimization for Dummies|