Skip site navigation (1) Skip section navigation (2)

ETL optimization

From: Bricklen Anderson <BAnderson(at)PresiNET(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: ETL optimization
Date: 2005-06-23 19:38:04
Message-ID: 42BB0F9C.2050207@PresiNET.com (view raw or flat)
Thread:
Lists: pgsql-performance
Situation:
I'm trying to optimize an ETL process with many upserts (~100k aggregated rows)
(no duplicates allowed). The source (table t2) table holds around 14 million
rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in about
100,000 distinct rows in the destination table (t1).


What I've tried:

i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish rows), and
try an update first, check the ROW_COUNT, if 0, then do an insert.
...
run time: approx. 25 mins


ii. in a function (pseudo code), (table name is dynamic):
...
up_stm :=
'UPDATE '||t1||' SET x=t2.x
FROM	(select sum(x),a,b,c
	from t2
	group by a,b,c) as t2
WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';

EXECUTE up_stm;

ins_stm :=
'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
WHERE NOT EXISTS
	(select true from '||t1||'
	where '||t1||'.a=t2.a
	and '||t1||'.b=t2.b
	and '||t1||'.c=t2.c
	limit 1)';

EXECUTE ins_stm;
...

takes about 7 minutes. The performance of this is reasonable, but there is room
for improvement.
I think it's the NOT EXISTS subquery on the insert that makes the first run
slow. Any revisions that may be faster (for the subquery)?
Note, this subquery is necessary so that duplicates don't get into the target
table (t1).

Subsequent runs will be mostly updates (and still slow), with few inserts. I'm
not seeing a way for that update statement to be sped up, but maybe someone else
does?


iii. UNIQUE constraint on table "t1". This didn't seem to perform too badly with
fewer rows (preliminary tests), but as you'd expect, on error the whole
transaction would roll back. Is it possible to skip a row if it causes an error,
as opposed to aborting the transaction altogether?



To summarize, I'm looking for the most efficient and fastest way to perform my
upserts. Tips and/or references to pertinent docs are also appreciated!
If any more information is necessary, please let me know.


(postgresql 8.0.3, linux)


Cheers,

Bricklen
-- 
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

Responses

pgsql-performance by date

Next:From: Meetesh KariaDate: 2005-06-23 19:54:06
Subject: Re: ETL optimization
Previous:From: Josh BerkusDate: 2005-06-23 18:55:35
Subject: Re: parameterized LIKE does not use index

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group