Re: best way to do bulk delete?

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: pg noob <pgnube(at)gmail(dot)com>
Cc: pgsql-novice <pgsql-novice(at)postgresql(dot)org>
Subject: Re: best way to do bulk delete?
Date: 2011-10-28 15:19:53
Message-ID: CAHyXU0wism0HkYj9A9=9AKD=07igQ8MZ3vuvfcZ6s3L8B_fing@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Fri, Oct 28, 2011 at 9:30 AM, pg noob <pgnube(at)gmail(dot)com> wrote:
>
> Greetings,
>
> I have an occasional recurring use case where I need to delete a large
> number of rows from multiple tables as quickly as possible,
> something like a bulk-delete.
>
> A typical example would be deleting a few hundred thousand rows at once from
> a set of tables each containing 1 to 2 million rows,
> but in a worst case scenario it could be as large as 1 million rows from a
> set of tables each containing 4 to 5 million rows.
>
> These tables of course have indexes on them as well as foreign key
> constraints and cascade deletes to other tables.
>
> I can't simply truncate the tables because the rows being deleted are
> subsets of the total amount of data in the tables.
>
> These tables have heavy insert/update/delete activity going on at the same
> time but mostly not on the same set of rows that
> is being bulk-deleted (though there may be some update activity going on
> which accesses those rows and hasn't yet quiesced).
>
> What is the best way of going about this?
>
> I've considered a few options.
>
> One option is to open a single transaction, issue delete statements that
> delete huge numbers of rows from each
> table in question (probably with a DELETE USING SQL query) and then commit
> the transaction.
> My concern with this approach is that it will hold a huge number of row
> locks while in progress and may take a long
> time to complete and could introduce deadlocks if it competes with other
> updates that have acquired row locks out of order.
>
> Another option would be to delete one row per transaction or a smaller set
> of rows as part of a transaction and then
> commit the transaction, repeating in a loop.
> This has the advantage that if the transaction hits an error and has to be
> rolled back it doesn't have to redo the entire
> delete operation again, and it doesn't hold as many row locks for as long a
> time.  The drawback is that I believe this approach
> would be a lot slower.
>
> And finally, I've considered the idea of using COPY to copy the data that
> needs to be kept to temporary tables,
> truncating the original tables and then copying the data back.  I believe
> this would be the most efficient way to
> do the delete but the implementation is quite a bit more complicated than
> the first two options I described,
> and has implications for how to deal with error scenarios or database/system
> crashes while the operation is in progress.

you don't need to COPY -- just insert/select, like this:

CREATE TABLE keeprows(LIKE old_table INCLUDING INDEXES INCLUDING CONSTRAINTS);
INSERT INTO keeprows SELECT * FROM old_table WHERE ...
DROP TABLE old_table;
ALTER TABLE keeprows RENAME to old_table CASCADE;
<reset RI rules>

This procedure is not side effect free and has some caveats: for
example, if you have code that is dependent on the table's composite
type that will also drop and has to be reconstructed.

Perhaps a better way to go if you can structure your code around it is
to simply partition your table around when it gets loaded and dropped
-- then your deletion becomes 'DROP TABLE' with no extra processing.

merlin

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Detox 2011-10-28 16:21:13 PgAdmin III shows removed PostgreSQL
Previous Message pg noob 2011-10-28 14:30:13 best way to do bulk delete?