Re: Duplicate deletion optimizations

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: antoine(at)inaps(dot)org
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Duplicate deletion optimizations
Date: 2012-01-06 20:02:24
Message-ID: CAEV0TzCqavqQTtsORx3thrKM3FjhahK4y5Zx4HF4W+-Bosdrmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Jan 6, 2012 at 6:35 AM, <antoine(at)inaps(dot)org> wrote:

> Hello,
>
> I've a table with approximately 50 million rows with a schema like this:
>
> id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass)**,
> t_value integer NOT NULL DEFAULT 0,
> t_record integer NOT NULL DEFAULT 0,
> output_id integer NOT NULL DEFAULT 0,
> count bigint NOT NULL DEFAULT 0,
> CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
>
> Every 5 minutes, a process have to insert a few thousand of rows in this
> table,
> but sometime, the process have to insert an already existing row (based on
> values in the triplet (t_value, t_record, output_id). In this case, the row
> must be updated with the new count value. I've tried some solution given
> on this
> stackoverflow question [1] but the insertion rate is always too low for my
> needs.
>
> So, I've decided to do it in two times:
>
> - I insert all my new data with a COPY command
> - When it's done, I run a delete query to remove oldest duplicates
>
> Right now, my delete query look like this:
>
> SELECT min(id) FROM stats_5mn
> GROUP BY t_value, t_record, output_id
> HAVING count(*) > 1;
>
> The duration of the query on my test machine with approx. 16 million rows
> is ~18s.
>

Have you considered doing the insert by doing a bulk insert into a temp
table and then pulling rows that don't exist across to the final table in
one query and updating rows that do exist in another query? I did a very
brief scan of the SO thread and didn't see it suggested. Something like
this:

update stats_5mn set count = count + t.count
from temp_table t
where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
stats_5mn.output_id = t.output_id;

insert into stats_5mn
select * from temp_table t
where not exists (
select 1 from stats_5mn s
where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
t.output_id
);

drop table temp_table;

Note - you must do the update before the insert because doing it the other
way around will cause every row you just inserted to also be updated.

I'm not sure it'd be markedly faster, but you'd at least be able to retain
a unique constraint on the triplet, if desired. And, to my eye, the logic
is easier to comprehend. The different query structure may make better use
of your index, but I imagine that it is not using it currently because your
db isn't configured to accurately reflect the real cost of index use vs
sequential scan, so it is incorrectly determining the cost of looking up
7.5 million rows. Its estimate of the row count is correct, so the
estimate of the cost must be the problem. We'd need to know more about
your current config and hardware specs to be able to even start making
suggestions about config changes to correct the problem.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Marc Eberhard 2012-01-06 20:22:46 Re: Duplicate deletion optimizations
Previous Message antoine 2012-01-06 15:21:06 Re: Duplicate deletion optimizations