Skip site navigation (1) Skip section navigation (2)

Re: ETL optimization

From: Jacques Caron <jc(at)directinfos(dot)com>
To: Bricklen Anderson <BAnderson(at)PresiNET(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: ETL optimization
Date: 2005-06-23 21:56:27
Message-ID: 6.2.0.14.0.20050623234711.053c4d48@pop.interactivemediafactory.net (view raw or flat)
Thread:
Lists: pgsql-performance
Hi,

At 21:38 23/06/2005, Bricklen Anderson wrote:
>Situation:
>I'm trying to optimize an ETL process with many upserts (~100k aggregated 
>rows)
>(no duplicates allowed). The source (table t2) table holds around 14 million
>rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in about
>100,000 distinct rows in the destination table (t1).
>
>
>What I've tried:
>
>i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish 
>rows), and
>try an update first, check the ROW_COUNT, if 0, then do an insert.
>...
>run time: approx. 25 mins
>
>
>ii. in a function (pseudo code), (table name is dynamic):
>...
>up_stm :=
>'UPDATE '||t1||' SET x=t2.x
>FROM    (select sum(x),a,b,c
>         from t2
>         group by a,b,c) as t2
>WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';
>
>EXECUTE up_stm;
>
>ins_stm :=
>'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
>FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
>WHERE NOT EXISTS
>         (select true from '||t1||'
>         where '||t1||'.a=t2.a
>         and '||t1||'.b=t2.b
>         and '||t1||'.c=t2.c
>         limit 1)';
>
>EXECUTE ins_stm;
>...

I have a similar situation, and the solution I use (though I haven't really 
tested many different situations):
- have a trigger ON INSERT which does:
UPDATE set whatever_value=NEW.whatever_value,... WHERE 
whatever_key=NEW.whatever.key AND...
IF FOUND THEN
  RETURN NULL;
ELSE
  RETURN NEW;
END IF;
- use COPY

For optimal performance, a different trigger function is created for each 
table, which allows the query plan of the UPDATE to be cached.

Let us know how that works out for you and if you find a better solution!

Jacques.



In response to

Responses

pgsql-performance by date

Next:From: Bricklen AndersonDate: 2005-06-23 22:04:33
Subject: Re: ETL optimization
Previous:From: Bricklen AndersonDate: 2005-06-23 20:16:50
Subject: Re: ETL optimization

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group