Re: Duplicate deletion optimizations

From: antoine(at)inaps(dot)org
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Duplicate deletion optimizations
Date: 2012-01-09 12:59:59
Message-ID: 1b467beb79cae86d93ab8e55d3e79292@inaps.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

Thanks for your numerous and complete answers!

For those who have asked for more information about the process and
hardware:

The goal of the process is to compute data from a nosql cluster and
write results in a PostgreSQL database. This process is triggered every
5 minutes for the latest 5 minutes data. 80% of data can be wrote in the
database with a simple copy, which is the fastest solution we found for
bulk insertion. But for some data, duplicates are possible (but very
unusual), and the new data must replace the old one in database. I'm
looking for the fastest solution to do this upsert.

About the hardware:

The PostgreSQL database run on a KVM virtual machine, configured with
8GB of ram and 4 cores of a L5640 CPU. The hypervisor have two 7,2k
standard SAS disks working in linux software raid 1. Disks are shared by
VMs, and obviously, this PostgreSQL VM doesn't share its hypervisor with
another "write-intensive" VM.

Also, this database is dedicated to store the data outgoing the
process, so I'm really free for its configuration and tuning. I also
plan to add a replicated slave database for read operations, and maybe
do a partitioning of data, if needed.

If I summarize your solutions:

- Add an "order by" statement to my initial query can help the planner
to use the index.
- Temporary tables, with a COPY of new data to the temporary table and
a merge of data (you proposed different ways for the merge).
- Use EXISTS statement in the delete (but not recommended by another
reply)

I'll try your ideas this week, and I'll give you results.

Antoine.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message darklow 2012-01-10 11:30:53 Query planner doesn't use index scan on tsvector GIN index if LIMIT is specified
Previous Message Pierre C 2012-01-08 18:09:57 Re: Duplicate deletion optimizations