Skip site navigation (1) Skip section navigation (2)

Duplicate deletion optimizations

From: antoine(at)inaps(dot)org
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Duplicate deletion optimizations
Date: 2012-01-06 14:35:36
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance

I've a table with approximately 50 million rows with a schema like 

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in 
this table,
but sometime, the process have to insert an already existing row (based 
values in the triplet (t_value, t_record, output_id). In this case, the 
must be updated with the new count value. I've tried some solution 
given on this
stackoverflow question [1] but the insertion rate is always too low for 
my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million 
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do 
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index 
is used [3]
and query duration is lowered to ~5s.

My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?

Thanks by advance for your help,
Antoine Millet.


[2] :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182 
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20) 
(actual time=15663.549..17463.458 rows=7621805 loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814 
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20) 
(actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330 
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805 


pgsql-performance by date

Next:From: antoineDate: 2012-01-06 15:21:06
Subject: Re: Duplicate deletion optimizations
Previous:From: Aleksej TrofimovDate: 2012-01-04 11:42:26
Subject: Re: Postgresql Replication Performance

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group