Skip site navigation (1) Skip section navigation (2)

Re: Best way to delete unreferenced rows?

From: "Tyrrill, Ed" <tyrrill_ed(at)emc(dot)com>
To: "Craig James" <craig_james(at)emolecules(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Best way to delete unreferenced rows?
Date: 2007-06-08 16:45:37
Message-ID: A23190A408F7094FAF446C1538222F760409337E@avaexch01.avamar.com (view raw or flat)
Thread:
Lists: pgsql-performance
Craig James wrote:
> Tyrrill, Ed wrote:
>
>> I have a table, let's call it A, whose primary key, a_id, is
referenced
>> in a second table, let's call it B.  For each unique A.a_id there are
>> generally many rows in B with the same a_id.  My problem is that I
want
>> to delete a row in A when the last row in B that references it is
>> deleted.  Right now I just query for rows in A that aren't referenced
by
>> B, and that worked great when the tables were small, but it takes
over
>> an hour now that the tables have grown larger (over 200 million rows
in
>> B and 14 million in A).  The delete has to do a sequential scan of
both
>> tables since I'm looking for what's not in the indexes.
>> 
>> I was going to try creating a trigger after delete on B for each row
to
>> check for more rows in B with the same a_id, and delete the row in A
if
>> none found.  In general I will be deleting 10's of millions of rows
from
>> B and 100's of thousands of rows from A on a daily basis.  What do
you
>> think?  Does anyone have any other suggestions on different ways to
>> approach this?
>
> Essentially what you're doing is taking the one-hour job and spreading
> out in little chunks over thousands of queries.  If you have 10^7 rows
> in B and 10^5 rows in A, then on average you have 100 references from
B
> to A.  That means that 99% of the time, your trigger will scan B and
find
> that there's nothing to do.  This could add a lot of overhead to your
> ordinary transactions, costing a lot more in the long run than just
doing
> the once-a-day big cleanout.
>
> You didn't send the specifics of the query you're using, along with an
> EXPLAIN ANALYZE of it in operation.  It also be that your SQL is not
> optimal, and that somebody could suggest a more efficient query.
>
> It's also possible that it's not the sequential scans that are the
> problem, but rather that it just takes a long time to delete 100,000
> rows from table A because you have a lot of indexes. Or it could be
> a combination of performance problems.
>
> You haven't given us enough information to really analyze your
problem.
> Send more details!
>
> Craig

Ok.  Yes, there are a bunch of indexes on A that may slow down the
delete, but if I just run the select part of the delete statement
through explain analyze then that is the majority of the time.  The
complete sql statement for the delete is:

delete from backupobjects where record_id in (select
backupobjects.record_id from backupobjects left outer join
backup_location using(record_id) where backup_location.record_id is null
)

What I've referred to as A is backupobjects, and B is backup_location.
Here is explain analyze of just the select:

mdsdb=# explain analyze select backupobjects.record_id from
backupobjects left outer join backup_location using(record_id) where
backup_location.record_id is null;
 
QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------
 Merge Left Join  (cost=38725295.93..42505394.70 rows=13799645 width=8)
(actual time=6503583.342..8220629.311 rows=93524 loops=1)
   Merge Cond: ("outer".record_id = "inner".record_id)
   Filter: ("inner".record_id IS NULL)
   ->  Index Scan using backupobjects_pkey on backupobjects
(cost=0.00..521525.10 rows=13799645 width=8) (actual
time=15.955..357813.621 rows=13799645 loops=1)
   ->  Sort  (cost=38725295.93..39262641.69 rows=214938304 width=8)
(actual time=6503265.293..7713657.750 rows=214938308 loops=1)
         Sort Key: backup_location.record_id
         ->  Seq Scan on backup_location  (cost=0.00..3311212.04
rows=214938304 width=8) (actual time=11.175..1881179.825 rows=214938308
loops=1)
 Total runtime: 8229178.269 ms
(8 rows)

I ran vacuum analyze after the last time any inserts, deletes, or
updates were done, and before I ran the query above.  I've attached my
postgresql.conf.  The machine has 4 GB of RAM.

Thanks,
Ed

Attachment: postgresql.conf
Description: application/octet-stream (13.5 KB)

In response to

Responses

pgsql-performance by date

Next:From: Guy RouillierDate: 2007-06-08 16:46:05
Subject: Re: How much ram is too much
Previous:From: Dave CramerDate: 2007-06-08 16:31:47
Subject: How much ram is too much

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group