Skip site navigation (1) Skip section navigation (2)

Re: Duplicate deletion optimizations

From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: "Jochen Erwied" <jochen(at)pgsql-performance(dot)erwied(dot)eu>, <antoine(at)inaps(dot)org>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Duplicate deletion optimizations
Date: 2012-01-07 12:21:02
Message-ID: C4DAC901169B624F933534A26ED7DF3103E917F3@JENMAIL01.ad.intershop.net (view raw or flat)
Thread:
Lists: pgsql-performance
Yes, but it should become a bit slower if you fix your code :-)

  where t_imp.id is null and test.id=t_imp.id;
  =>
  where t_imp.id is not null and test.id=t_imp.id;

and a partial index on matching rows might help (should be tested):

 (after the first updat)
 create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null.

regards,
Marc Mamin

-----Urspr√ľngliche Nachricht-----
Von: pgsql-performance-owner(at)postgresql(dot)org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: antoine(at)inaps(dot)org
Cc: pgsql-performance(at)postgresql(dot)org
Betreff: Re: [PERFORM] Duplicate deletion optimizations
 
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
        $i += rand(2000);
        print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
       set id=test.id
       from test
       where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
       set count=t_imp.count
       from t_imp
       where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
       select t_value,t_record,output_id,count
              from t_imp
              where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid             | 14332525
schemaname        | public
relname           | test
seq_scan          | 8
seq_tup_read      | 111541821
idx_scan          | 149240169
idx_tup_fetch     | 117901695
n_tup_ins         | 30280175
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 30264431
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count      | 0
autovacuum_count  | 0
analyze_count     | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied     |   home: jochen(at)erwied(dot)eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe(at)mbs-software(dot)de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen(dot)erwied(at)vodafone(dot)de       +49-173-5404164


In response to

Responses

pgsql-performance by date

Next:From: Misa SimicDate: 2012-01-07 14:02:10
Subject: Re: Duplicate deletion optimizations
Previous:From: Pierre CDate: 2012-01-07 12:20:03
Subject: Re: Duplicate deletion optimizations

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group