Re: Best way to delete unreferenced rows?

From: "Tyrrill, Ed" <tyrrill_ed(at)emc(dot)com>
To: "Craig James" <craig_james(at)emolecules(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Best way to delete unreferenced rows?
Date: 2007-06-08 16:45:37
Message-ID: A23190A408F7094FAF446C1538222F760409337E@avaexch01.avamar.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Craig James wrote:
> Tyrrill, Ed wrote:
>
>> I have a table, let's call it A, whose primary key, a_id, is
referenced
>> in a second table, let's call it B. For each unique A.a_id there are
>> generally many rows in B with the same a_id. My problem is that I
want
>> to delete a row in A when the last row in B that references it is
>> deleted. Right now I just query for rows in A that aren't referenced
by
>> B, and that worked great when the tables were small, but it takes
over
>> an hour now that the tables have grown larger (over 200 million rows
in
>> B and 14 million in A). The delete has to do a sequential scan of
both
>> tables since I'm looking for what's not in the indexes.
>>
>> I was going to try creating a trigger after delete on B for each row
to
>> check for more rows in B with the same a_id, and delete the row in A
if
>> none found. In general I will be deleting 10's of millions of rows
from
>> B and 100's of thousands of rows from A on a daily basis. What do
you
>> think? Does anyone have any other suggestions on different ways to
>> approach this?
>
> Essentially what you're doing is taking the one-hour job and spreading
> out in little chunks over thousands of queries. If you have 10^7 rows
> in B and 10^5 rows in A, then on average you have 100 references from
B
> to A. That means that 99% of the time, your trigger will scan B and
find
> that there's nothing to do. This could add a lot of overhead to your
> ordinary transactions, costing a lot more in the long run than just
doing
> the once-a-day big cleanout.
>
> You didn't send the specifics of the query you're using, along with an
> EXPLAIN ANALYZE of it in operation. It also be that your SQL is not
> optimal, and that somebody could suggest a more efficient query.
>
> It's also possible that it's not the sequential scans that are the
> problem, but rather that it just takes a long time to delete 100,000
> rows from table A because you have a lot of indexes. Or it could be
> a combination of performance problems.
>
> You haven't given us enough information to really analyze your
problem.
> Send more details!
>
> Craig

Ok. Yes, there are a bunch of indexes on A that may slow down the
delete, but if I just run the select part of the delete statement
through explain analyze then that is the majority of the time. The
complete sql statement for the delete is:

delete from backupobjects where record_id in (select
backupobjects.record_id from backupobjects left outer join
backup_location using(record_id) where backup_location.record_id is null
)

What I've referred to as A is backupobjects, and B is backup_location.
Here is explain analyze of just the select:

mdsdb=# explain analyze select backupobjects.record_id from
backupobjects left outer join backup_location using(record_id) where
backup_location.record_id is null;

QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------
Merge Left Join (cost=38725295.93..42505394.70 rows=13799645 width=8)
(actual time=6503583.342..8220629.311 rows=93524 loops=1)
Merge Cond: ("outer".record_id = "inner".record_id)
Filter: ("inner".record_id IS NULL)
-> Index Scan using backupobjects_pkey on backupobjects
(cost=0.00..521525.10 rows=13799645 width=8) (actual
time=15.955..357813.621 rows=13799645 loops=1)
-> Sort (cost=38725295.93..39262641.69 rows=214938304 width=8)
(actual time=6503265.293..7713657.750 rows=214938308 loops=1)
Sort Key: backup_location.record_id
-> Seq Scan on backup_location (cost=0.00..3311212.04
rows=214938304 width=8) (actual time=11.175..1881179.825 rows=214938308
loops=1)
Total runtime: 8229178.269 ms
(8 rows)

I ran vacuum analyze after the last time any inserts, deletes, or
updates were done, and before I ran the query above. I've attached my
postgresql.conf. The machine has 4 GB of RAM.

Thanks,
Ed

Attachment Content-Type Size
postgresql.conf application/octet-stream 13.5 KB

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Guy Rouillier 2007-06-08 16:46:05 Re: How much ram is too much
Previous Message Dave Cramer 2007-06-08 16:31:47 How much ram is too much