Re: ETL optimization

From: Meetesh Karia <meetesh(dot)karia(at)gmail(dot)com>
To: Bricklen Anderson <BAnderson(at)presinet(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: ETL optimization
Date: 2005-06-23 19:54:06
Message-ID: fc5b04ca05062312546323c07e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I don't know what this will change wrt how often you need to run VACUUM (I'm
a SQL Server guy), but instead of an update and insert, try a delete and
insert. You'll only have to find the duplicate rows once and your insert
doesn't need a where clause.

Meetesh

On 6/23/05, Bricklen Anderson <BAnderson(at)presinet(dot)com> wrote:
>
> Situation:
> I'm trying to optimize an ETL process with many upserts (~100k aggregated
> rows)
> (no duplicates allowed). The source (table t2) table holds around 14
> million
> rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in
> about
> 100,000 distinct rows in the destination table (t1).
>
>
> What I've tried:
>
> i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish
> rows), and
> try an update first, check the ROW_COUNT, if 0, then do an insert.
> ...
> run time: approx. 25 mins
>
>
> ii. in a function (pseudo code), (table name is dynamic):
> ...
> up_stm :=
> 'UPDATE '||t1||' SET x=t2.x
> FROM (select sum(x),a,b,c
> from t2
> group by a,b,c) as t2
> WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';
>
> EXECUTE up_stm;
>
> ins_stm :=
> 'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
> FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
> WHERE NOT EXISTS
> (select true from '||t1||'
> where '||t1||'.a=t2.a
> and '||t1||'.b=t2.b
> and '||t1||'.c=t2.c
> limit 1)';
>
> EXECUTE ins_stm;
> ...
>
> takes about 7 minutes. The performance of this is reasonable, but there is
> room
> for improvement.
> I think it's the NOT EXISTS subquery on the insert that makes the first
> run
> slow. Any revisions that may be faster (for the subquery)?
> Note, this subquery is necessary so that duplicates don't get into the
> target
> table (t1).
>
> Subsequent runs will be mostly updates (and still slow), with few inserts.
> I'm
> not seeing a way for that update statement to be sped up, but maybe
> someone else
> does?
>
>
> iii. UNIQUE constraint on table "t1". This didn't seem to perform too
> badly with
> fewer rows (preliminary tests), but as you'd expect, on error the whole
> transaction would roll back. Is it possible to skip a row if it causes an
> error,
> as opposed to aborting the transaction altogether?
>
>
>
> To summarize, I'm looking for the most efficient and fastest way to
> perform my
> upserts. Tips and/or references to pertinent docs are also appreciated!
> If any more information is necessary, please let me know.
>
>
> (postgresql 8.0.3, linux)
>
>
> Cheers,
>
> Bricklen
> --
> _______________________________
>
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
> _______________________________
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bricklen Anderson 2005-06-23 20:16:50 Re: ETL optimization
Previous Message Bricklen Anderson 2005-06-23 19:38:04 ETL optimization