On Thu, 2006-09-28 at 12:44 -0400, Carlo Stonebanks wrote:
> At this early stage in the project, we are initializing our portal's
> database with millions of rows of imported data in over 50 different
> flattened tables; each table's structure is unique to the data provider.
> This requires a pretty complex import program, because the data must be
> matched semantically, not literally. Even with all of the expression
> matching and fuzzy logic in the code,our performance statistics show that
> the program spends over 75% of its time in SQL queries looking for matching
> and/or duplicate data.
My experience with that type of load process is that doing this
row-by-row is a very expensive approach and your results bear that out.
It is often better to write each step as an SQL statement that operates
on a set of rows at one time. The lookup operations become merge joins
rather than individual SQL Selects via an index, so increase the
efficiency of the lookup process by using bulk optimisations and
completely avoiding any program/server call traffic. Data can move from
step to step by using Insert Selects into temporary tables, as Jim has
The SQL set approach is different to the idea of simply moving the code
server-side by dropping it in a function. That helps with the net
traffic but has other issues also. You don't need to use esoteric
in-memory thingies if you use the more efficient join types already
available when you do set based operations (i.e. join all rows at once
in one big SQL statement).
You can also improve performance by ordering your checks so that the
ones most likely to fail happen first.
Trying to achieve a high level of data quality in one large project is
not often possible. Focus on the most critical areas of checking and get
that working first with acceptable performance, then layer on additional
checks while tuning. The complexity of the load programs you have also
means they are susceptible to introducing data quality problems rather
than removing them, so an incremental approach will also aid debugging
of the load suite. Dynamic SQL programs are particularly susceptible to
this kind of bug because you can't eyeball the code.
In response to
pgsql-performance by date
|Next:||From: Simon Riggs||Date: 2006-09-29 10:58:33|
|Subject: Re: archive wal's failure and load increase.|
|Previous:||From: Edoardo Ceccarelli||Date: 2006-09-29 09:19:19|
|Subject: Re: any hope for my big query?|