best way to do bulk delete?

From: pg noob <pgnube(at)gmail(dot)com>
To: pgsql-novice <pgsql-novice(at)postgresql(dot)org>
Subject: best way to do bulk delete?
Date: 2011-10-28 14:30:13
Message-ID: CAPNY-2W9a1qn+hAb=x8+NHn6kRaGyUjMg1tY+4zBNTnYaVVepA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Greetings,

I have an occasional recurring use case where I need to delete a large
number of rows from multiple tables as quickly as possible,
something like a bulk-delete.

A typical example would be deleting a few hundred thousand rows at once from
a set of tables each containing 1 to 2 million rows,
but in a worst case scenario it could be as large as 1 million rows from a
set of tables each containing 4 to 5 million rows.

These tables of course have indexes on them as well as foreign key
constraints and cascade deletes to other tables.

I can't simply truncate the tables because the rows being deleted are
subsets of the total amount of data in the tables.

These tables have heavy insert/update/delete activity going on at the same
time but mostly not on the same set of rows that
is being bulk-deleted (though there may be some update activity going on
which accesses those rows and hasn't yet quiesced).

What is the best way of going about this?

I've considered a few options.

One option is to open a single transaction, issue delete statements that
delete huge numbers of rows from each
table in question (probably with a DELETE USING SQL query) and then commit
the transaction.
My concern with this approach is that it will hold a huge number of row
locks while in progress and may take a long
time to complete and could introduce deadlocks if it competes with other
updates that have acquired row locks out of order.

Another option would be to delete one row per transaction or a smaller set
of rows as part of a transaction and then
commit the transaction, repeating in a loop.
This has the advantage that if the transaction hits an error and has to be
rolled back it doesn't have to redo the entire
delete operation again, and it doesn't hold as many row locks for as long a
time. The drawback is that I believe this approach
would be a lot slower.

And finally, I've considered the idea of using COPY to copy the data that
needs to be kept to temporary tables,
truncating the original tables and then copying the data back. I believe
this would be the most efficient way to
do the delete but the implementation is quite a bit more complicated than
the first two options I described,
and has implications for how to deal with error scenarios or database/system
crashes while the operation is in progress.

Is there a better way that I haven't thought of? How would you handle this?

Thank you in advance for any ideas you might have.

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Merlin Moncure 2011-10-28 15:19:53 Re: best way to do bulk delete?
Previous Message Nuno Ferreira 2011-10-27 08:04:55 Re: Copy database to another machine