Skip site navigation (1) Skip section navigation (2)

Re: Duplicate deletion optimizations

From: antoine(at)inaps(dot)org
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Duplicate deletion optimizations
Date: 2012-01-06 15:21:06
Message-ID: 3907c59005bc154cf0b8ebb918303d51@inaps.org (view raw or flat)
Thread:
Lists: pgsql-performance
On Fri, 06 Jan 2012 15:35:36 +0100, antoine(at)inaps(dot)org wrote:
> Hello,
>
> I've a table with approximately 50 million rows with a schema like 
> this:
>
>     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
>     t_value integer NOT NULL DEFAULT 0,
>     t_record integer NOT NULL DEFAULT 0,
>     output_id integer NOT NULL DEFAULT 0,
>     count bigint NOT NULL DEFAULT 0,
>     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
>
> Every 5 minutes, a process have to insert a few thousand of rows in
> this table,
> but sometime, the process have to insert an already existing row 
> (based on
> values in the triplet (t_value, t_record, output_id). In this case, 
> the row
> must be updated with the new count value. I've tried some solution
> given on this
> stackoverflow question [1] but the insertion rate is always too low
> for my needs.
>
> So, I've decided to do it in two times:
>
>  - I insert all my new data with a COPY command
>  - When it's done, I run a delete query to remove oldest duplicates
>
> Right now, my delete query look like this:
>
>     SELECT min(id) FROM stats_5mn
>     GROUP BY t_value, t_record, output_id
>     HAVING count(*) > 1;

Correction:

     DELETE FROM stats_5mn WHERE id in (
         SELECT min(id) FROM stats_5mn
         GROUP BY t_value, t_record, output_id
         HAVING count(*) > 1;
     );

Sorry :-)

>
> The duration of the query on my test machine with approx. 16 million
> rows is ~18s.
>
> To reduce this duration, I've tried to add an index on my triplet:
>
>     CREATE INDEX test
>       ON stats_5mn
>       USING btree
>       (t_value , t_record , output_id );
>
> By default, the PostgreSQL planner doesn't want to use my index and
> do a sequential
> scan [2], but if I force it with "SET enable_seqscan = off", the
> index is used [3]
> and query duration is lowered to ~5s.
>
>
> My questions:
>
>  - Why the planner refuse to use my index?
>  - Is there a better method for my problem?
>
>
> Thanks by advance for your help,
> Antoine Millet.
>
>
> [1]
> 
> http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
>
> 
> http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different
>
> [2] http://explain.depesz.com/s/UzW :
>     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
> width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
>         Filter: (five(*) > 1)
>       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
> (actual time=15663.549..17463.458 rows=7621805 loops=1)
>               Sort Key: delta, kilo, four
>               Sort Method:  external merge  Disk: 223512kB
>             ->  Seq Scan on three  (cost=0.000..139734.140
> rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805
> loops=1)
>
> [3] http://explain.depesz.com/s/o9P :
>     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
> (actual time=5307.734..5307.734 rows=0 loops=1)
>         Filter: (five(*) > 1)
>       ->  Index Scan using charlie on three
> (cost=0.000..11422738.330 rows=7621814 width=20) (actual
> time=0.046..2062.952 rows=7621805 loops=1)


In response to

Responses

pgsql-performance by date

Next:From: Samuel GendlerDate: 2012-01-06 20:02:24
Subject: Re: Duplicate deletion optimizations
Previous:From: antoineDate: 2012-01-06 14:35:36
Subject: Duplicate deletion optimizations

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group