Re: Delete performance

From: Phillip Sitbon <phillip(at)sitbon(dot)net>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: Delete performance
Date: 2009-06-15 18:34:40
Message-ID: 536685ea0906151134o53c5627ep36167f0511c08953@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

I've had to do this quite a bit, and here's how I usually go about it:

DELETE FROM data_structures_items WHERE NOT EXISTS
(SELECT 1 FROM data_structures WHERE id_structure=id_data_structure LIMIT 1);

Even when the item in the subquery is a primary key, I find it
semantically informative to always use "LIMIT 1".

Looking closer at your structure, this looks a lot like a problem I
have had in the past with almost the exact same scenario. What I
ended up doing beyond the above suggested query was to create a
temporary table. Since you're lucky enough to be working with a
primary key in the large table (I wasn't), you can copy all of the IDs
into temp table. Then delete from the temp table anything that appears
in your data_structures IDs. I'm not sure if it's necessary, but you
could also create an index on the temp table.

So your query becomes:

DELETE FROM data_structures_items WHERE EXISTS
(SELECT 1 FROM tmp_not_in_data_structures WHERE
id_structure=id_data_structure LIMIT 1);

This should help if the set of IDs to delete is smaller than the
entire set of IDs. It works for me, but I can't be sure it'll help you
- my main goal was to minimize the condition checking overhead for
each row of the big table. Also, I'm not sure if it's
valid/possible/better, but instead of EXISTS, "IS NOT NULL" might work
here as well.

If your indexes are rather large (and therefore may not fit into
memory), you might consider partitioning. Along with the above
solution, I use partitions and operate on them in parallel from a
Python client.

Cheers,

Phillip

6/15 Jana <jana(dot)vasseru(at)gmail(dot)com>:
> On Sun, 14 Jun 2009 18:12:50 +0200, Frank Bax <fbax(at)sympatico(dot)ca> wrote:
>
>> Jana wrote:
>>>
>>>  Hello,
>>> i have a table with about 250m records from which i want to delete thoose
>>> not contained in other table. I used this SQL query:
>>> DELETE FROM data_structures_items WHERE id_data_structure NOT IN (
>>> SELECT id_structure FROM data_structures);
>>
>>
>> DELETE FROM data_structures_items, data_structures WHERE
>> data_structures_items.id_data_structure = data_structures.id_data_structure
>> AND data_structures_items.id_data_structure IS NULL;
>>
>
> Thanks for answer, this however is not a valid syntax (at least according to
> manual, and my 8.3)
> version. What could be done is
>
> DELETE FROM data_structures_items USING data_structures
> WHERE
> data_structures_items.id_data_structure=data_structures.id_structure
> AND something
>
> but the problem is in that "something". I cannot write
> data_structures_items.id_data_structure =
> data_structures.id_data_structure AND
> data_structures_items.id_data_structure IS NULL;
>
> because it a) doesn't make sense (column can't be null and equal to
> something at the same time), b) doesn't select what i want (rows whoose
> id_data_structure is NOT in the data_structures table, i'm pretty sure it
> is a number ). I can't join tables in DELETE command, and i can't think of a
> way doing it with WHERE
>
> Regards,
>  Jana
>
> --
> Sent via pgsql-novice mailing list (pgsql-novice(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-novice
>

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Tom Lane 2009-06-15 18:46:52 Re: Delete performance
Previous Message Jana 2009-06-15 13:33:45 Re: Delete performance