Re: Best way to delete unreferenced rows?

From: Craig James <craig_james(at)emolecules(dot)com>
To: "Tyrrill, Ed" <tyrrill_ed(at)emc(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Best way to delete unreferenced rows?
Date: 2007-06-07 22:50:45
Message-ID: 46688BC5.3020006@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tyrrill, Ed wrote:

> I have a table, let's call it A, whose primary key, a_id, is referenced
> in a second table, let's call it B. For each unique A.a_id there are
> generally many rows in B with the same a_id. My problem is that I want
> to delete a row in A when the last row in B that references it is
> deleted. Right now I just query for rows in A that aren't referenced by
> B, and that worked great when the tables were small, but it takes over
> an hour now that the tables have grown larger (over 200 million rows in
> B and 14 million in A). The delete has to do a sequential scan of both
> tables since I'm looking for what's not in the indexes.
>
> I was going to try creating a trigger after delete on B for each row to
> check for more rows in B with the same a_id, and delete the row in A if
> none found. In general I will be deleting 10's of millions of rows from
> B and 100's of thousands of rows from A on a daily basis. What do you
> think? Does anyone have any other suggestions on different ways to
> approach this?

Essentially what you're doing is taking the one-hour job and spreading out in little chunks over thousands of queries. If you have 10^7 rows in B and 10^5 rows in A, then on average you have 100 references from B to A. That means that 99% of the time, your trigger will scan B and find that there's nothing to do. This could add a lot of overhead to your ordinary transactions, costing a lot more in the long run than just doing the once-a-day big cleanout.

You didn't send the specifics of the query you're using, along with an EXPLAIN ANALYZE of it in operation. It also be that your SQL is not optimal, and that somebody could suggest a more efficient query.

It's also possible that it's not the sequential scans that are the problem, but rather that it just takes a long time to delete 100,000 rows from table A because you have a lot of indexes. Or it could be a combination of performance problems.

You haven't given us enough information to really analyze your problem. Send more details!

Craig

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mark Kirkwood 2007-06-07 23:04:54 Re: Weird 8.2.4 performance
Previous Message Dan Harris 2007-06-07 21:26:56 Re: reclaiming disk space after major updates