Skip site navigation (1) Skip section navigation (2)

Re: Best way to delete unreferenced rows?

From: Ed Tyrrill <etyrrill(at)avamar(dot)com>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: "Tyrrill, Ed" <tyrrill_ed(at)emc(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Best way to delete unreferenced rows?
Date: 2007-06-12 01:09:46
Message-ID: 1181610586.27394.7.camel@nickel.avamar.com (view raw or flat)
Thread:
Lists: pgsql-performance
Craig James wrote:
> Tyrrill, Ed wrote:
> > QUERY PLAN
> >
> >
> ------------------------------------------------------------------------
> >
> ------------------------------------------------------------------------
> > -------------------
> >  Merge Left Join  (cost=38725295.93..42505394.70 rows=13799645
> width=8)
> > (actual time=6503583.342..8220629.311 rows=93524 loops=1)
> >    Merge Cond: ("outer".record_id = "inner".record_id)
> >    Filter: ("inner".record_id IS NULL)
> >    ->  Index Scan using backupobjects_pkey on backupobjects
> > (cost=0.00..521525.10 rows=13799645 width=8) (actual
> > time=15.955..357813.621 rows=13799645 loops=1)
> >    ->  Sort  (cost=38725295.93..39262641.69 rows=214938304 width=8)
> > (actual time=6503265.293..7713657.750 rows=214938308 loops=1)
> >          Sort Key: backup_location.record_id
> >          ->  Seq Scan on backup_location  (cost=0.00..3311212.04
> > rows=214938304 width=8) (actual time=11.175..1881179.825
> rows=214938308
> > loops=1)
> >  Total runtime: 8229178.269 ms
> > (8 rows)
> >
> > I ran vacuum analyze after the last time any inserts, deletes, or
> > updates were done, and before I ran the query above.  I've attached
> my
> > postgresql.conf.  The machine has 4 GB of RAM.
> 
> I thought maybe someone with more expertise than me might answer this,
> but since they haven't I'll just make a comment.  It looks to me like
> the sort of 214 million rows is what's killing you.  I suppose you
> could try to increase the sort memory, but that's a lot of memory.  It
> seems to me an index merge of a relation this large would be faster,
> but that's a topic for the experts.
> 
> On a theoretical level, the problem is that it's sorting the largest
> table.  Perhaps you could re-cast the query so that it only has to
> sort the smaller table, something like
> 
>    select a.id from a where a.id not in (select distinct b.id from b)
> 
> where "b" is the smaller table.  There's still no guarantee that it
> won't do a sort on "a", though.  In fact one of the clever things
> about Postgres is that it can convert a query like the one above into
> a regular join, unless you do something like "select ... offset 0"
> which blocks the optimizer from doing the rearrangement.
> 
> But I think the first approach is to try to tune for a better plan
> using your original query.
> 
> Craig

Thanks for the input Craig.  I actually started out with a query similar
to what you suggest, but the performance was days to complete back when
the larger table, backup_location, was still under 100 million rows.
The current query is the best performance to date.  I have been playing
around with work_mem, and doubling it to 128MB did result in some
improvement, but doubleing it again to 256MB showed no further gain.
Here is the explain analyze with work_mem increased to 128MB:

mdsdb=# explain analyze select backupobjects.record_id from
backupobjects left outer join backup_location using (record_id) where
backup_location.record_id is null;

QUERY
PLAN                                                                                                               
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=36876242.28..40658535.53 rows=13712990 width=8)
(actual time=5795768.950..5795768.950 rows=0 loops=1)
   Merge Cond: ("outer".record_id = "inner".record_id)
   Filter: ("inner".record_id IS NULL)
   ->  Index Scan using backupobjects_pkey on backupobjects
(cost=0.00..520571.89 rows=13712990 width=8) (actual
time=2.490..201516.228 rows=13706121 loops=1)
   ->  Sort  (cost=36876242.28..37414148.76 rows=215162592 width=8)
(actual time=4904205.255..5440137.309 rows=215162559 loops=1)
         Sort Key: backup_location.record_id
         ->  Seq Scan on backup_location  (cost=0.00..3314666.92
rows=215162592 width=8) (actual time=4.186..1262641.774 rows=215162559
loops=1)
 Total runtime: 5796322.535 ms



In response to

pgsql-performance by date

Next:From: Francisco ReyesDate: 2007-06-12 01:14:43
Subject: Best use of second controller with faster disks?
Previous:From: Craig JamesDate: 2007-06-11 18:26:55
Subject: Re: test / live environment, major performance difference

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group