Quick Links

Re: Duplicate deletion optimizations

From:	Misa Simic <misa(dot)simic(at)gmail(dot)com>
To:	"antoine(at)inaps(dot)org" <antoine(at)inaps(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Duplicate deletion optimizations
Date:	2012-01-07 03:28:06
Message-ID:	-8455307354303926221@unknownmsgid
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

hi,

Maybe these thoughts could help....

1) order by those three columns in your select min query could force
index usage...

2) or

DELETE FROM table
WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 =
table.col1 AND t.col2 = table.col2 AND col3 = table.col3)

Sent from my Windows Phone
From: antoine(at)inaps(dot)org
Sent: 06/01/2012 15:36
To: pgsql-performance(at)postgresql(dot)org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,

I've a table with approximately 50 million rows with a schema like
this:

id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.

So, I've decided to do it in two times:

- I insert all my new data with a COPY command
- When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

CREATE INDEX test
ON stats_5mn
USING btree
(t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index
is used [3]
and query duration is lowered to ~5s.

My questions:

- Why the planner refuse to use my index?
- Is there a better method for my problem?

Thanks by advance for your help,
Antoine Millet.

[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
GroupAggregate (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
Filter: (five(*) > 1)
-> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
Sort Key: delta, kilo, four
Sort Method: external merge Disk: 223512kB
-> Seq Scan on three (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
Filter: (five(*) > 1)
-> Index Scan using charlie on three (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Jochen Erwied	2012-01-07 11:57:26	Re: Duplicate deletion optimizations
Previous Message	Strange, John W	2012-01-07 00:02:01	Re: Duplicate deletion optimizations