Re: Duplicate deletion optimizations

From: "Strange, John W" <john(dot)w(dot)strange(at)jpmorgan(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Duplicate deletion optimizations
Date: 2012-01-07 00:02:01
Message-ID: D86EB8F058615F40948C622359204B3F05460A2862@EMASC201VS01.exchad.jpmchase.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Are your stats updated on the table after you added the index?

- run the bad query with explain verbose on (you should send this anyways)
- check to see what the difference is in expected rows vs. actual rows
- make sure that your work_mem is high enough if you are sorting, if not you'll see it write out a temp file which will be slow.
- if there is different analyze the table and rerun the query to see if you get the expected results.
- I do believe having COUNT(*) > 1 will never use an index, but someone more experience can comment here.

-----Original Message-----
From: pgsql-performance-owner(at)postgresql(dot)org [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of antoine(at)inaps(dot)org
Sent: Friday, January 06, 2012 8:36 AM
To: pgsql-performance(at)postgresql(dot)org
Subject: [PERFORM] Duplicate deletion optimizations

Hello,

I've a table with approximately 50 million rows with a schema like
this:

id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs.

So, I've decided to do it in two times:

- I insert all my new data with a COPY command
- When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

CREATE INDEX test
ON stats_5mn
USING btree
(t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s.

My questions:

- Why the planner refuse to use my index?
- Is there a better method for my problem?

Thanks by advance for your help,
Antoine Millet.

[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
GroupAggregate (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
Filter: (five(*) > 1)
-> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1)
Sort Key: delta, kilo, four
Sort Method: external merge Disk: 223512kB
-> Seq Scan on three (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1)
Filter: (five(*) > 1)
-> Index Scan using charlie on three (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Misa Simic 2012-01-07 03:28:06 Re: Duplicate deletion optimizations
Previous Message Marc Eberhard 2012-01-06 22:20:35 Re: Duplicate deletion optimizations