I don't know what this will change wrt how often you need to run VACUUM (I'm
a SQL Server guy), but instead of an update and insert, try a delete and
insert. You'll only have to find the duplicate rows once and your insert
doesn't need a where clause.
On 6/23/05, Bricklen Anderson <BAnderson(at)presinet(dot)com> wrote:
> I'm trying to optimize an ETL process with many upserts (~100k aggregated
> (no duplicates allowed). The source (table t2) table holds around 14
> rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in
> 100,000 distinct rows in the destination table (t1).
> What I've tried:
> i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish
> rows), and
> try an update first, check the ROW_COUNT, if 0, then do an insert.
> run time: approx. 25 mins
> ii. in a function (pseudo code), (table name is dynamic):
> up_stm :=
> 'UPDATE '||t1||' SET x=t2.x
> FROM (select sum(x),a,b,c
> from t2
> group by a,b,c) as t2
> WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';
> EXECUTE up_stm;
> ins_stm :=
> 'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
> FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
> WHERE NOT EXISTS
> (select true from '||t1||'
> where '||t1||'.a=t2.a
> and '||t1||'.b=t2.b
> and '||t1||'.c=t2.c
> limit 1)';
> EXECUTE ins_stm;
> takes about 7 minutes. The performance of this is reasonable, but there is
> for improvement.
> I think it's the NOT EXISTS subquery on the insert that makes the first
> slow. Any revisions that may be faster (for the subquery)?
> Note, this subquery is necessary so that duplicates don't get into the
> table (t1).
> Subsequent runs will be mostly updates (and still slow), with few inserts.
> not seeing a way for that update statement to be sped up, but maybe
> someone else
> iii. UNIQUE constraint on table "t1". This didn't seem to perform too
> badly with
> fewer rows (preliminary tests), but as you'd expect, on error the whole
> transaction would roll back. Is it possible to skip a row if it causes an
> as opposed to aborting the transaction altogether?
> To summarize, I'm looking for the most efficient and fastest way to
> perform my
> upserts. Tips and/or references to pertinent docs are also appreciated!
> If any more information is necessary, please let me know.
> (postgresql 8.0.3, linux)
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
In response to
pgsql-performance by date
|Next:||From: Bricklen Anderson||Date: 2005-06-23 20:16:50|
|Subject: Re: ETL optimization|
|Previous:||From: Bricklen Anderson||Date: 2005-06-23 19:38:04|
|Subject: ETL optimization|