Skip site navigation (1) Skip section navigation (2)

Re: ETL optimization

From: Meetesh Karia <meetesh(dot)karia(at)gmail(dot)com>
To: Bricklen Anderson <BAnderson(at)presinet(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: ETL optimization
Date: 2005-06-23 19:54:06
Message-ID: fc5b04ca05062312546323c07e@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-performance
I don't know what this will change wrt how often you need to run VACUUM (I'm 
a SQL Server guy), but instead of an update and insert, try a delete and 
insert. You'll only have to find the duplicate rows once and your insert 
doesn't need a where clause.

Meetesh

On 6/23/05, Bricklen Anderson <BAnderson(at)presinet(dot)com> wrote:
> 
> Situation:
> I'm trying to optimize an ETL process with many upserts (~100k aggregated 
> rows)
> (no duplicates allowed). The source (table t2) table holds around 14 
> million
> rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in 
> about
> 100,000 distinct rows in the destination table (t1).
> 
> 
> What I've tried:
> 
> i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish 
> rows), and
> try an update first, check the ROW_COUNT, if 0, then do an insert.
> ...
> run time: approx. 25 mins
> 
> 
> ii. in a function (pseudo code), (table name is dynamic):
> ...
> up_stm :=
> 'UPDATE '||t1||' SET x=t2.x
> FROM (select sum(x),a,b,c
> from t2
> group by a,b,c) as t2
> WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';
> 
> EXECUTE up_stm;
> 
> ins_stm :=
> 'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
> FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
> WHERE NOT EXISTS
> (select true from '||t1||'
> where '||t1||'.a=t2.a
> and '||t1||'.b=t2.b
> and '||t1||'.c=t2.c
> limit 1)';
> 
> EXECUTE ins_stm;
> ...
> 
> takes about 7 minutes. The performance of this is reasonable, but there is 
> room
> for improvement.
> I think it's the NOT EXISTS subquery on the insert that makes the first 
> run
> slow. Any revisions that may be faster (for the subquery)?
> Note, this subquery is necessary so that duplicates don't get into the 
> target
> table (t1).
> 
> Subsequent runs will be mostly updates (and still slow), with few inserts. 
> I'm
> not seeing a way for that update statement to be sped up, but maybe 
> someone else
> does?
> 
> 
> iii. UNIQUE constraint on table "t1". This didn't seem to perform too 
> badly with
> fewer rows (preliminary tests), but as you'd expect, on error the whole
> transaction would roll back. Is it possible to skip a row if it causes an 
> error,
> as opposed to aborting the transaction altogether?
> 
> 
> 
> To summarize, I'm looking for the most efficient and fastest way to 
> perform my
> upserts. Tips and/or references to pertinent docs are also appreciated!
> If any more information is necessary, please let me know.
> 
> 
> (postgresql 8.0.3, linux)
> 
> 
> Cheers,
> 
> Bricklen
> --
> _______________________________
> 
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
> _______________________________
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

In response to

Responses

pgsql-performance by date

Next:From: Bricklen AndersonDate: 2005-06-23 20:16:50
Subject: Re: ETL optimization
Previous:From: Bricklen AndersonDate: 2005-06-23 19:38:04
Subject: ETL optimization

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group