Re: Very long deletion time on a 200 GB database

From: "Reuven M(dot) Lerner" <reuven(at)lerner(dot)co(dot)il>
To: sthomas(at)peak6(dot)com
Cc: Glyn Astill <glynastill(at)yahoo(dot)co(dot)uk>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Very long deletion time on a 200 GB database
Date: 2012-02-26 10:46:28
Message-ID: 4F4A0D84.2070108@lerner.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi again, everyone.

Wow, I can't get over how helpful everyone has been.

Shaun wrote:

> The main problem you're going to run into is that your table is larger than the memory in that server. 4GB is really pretty small for a server hosting a 200+GB database. That they didn't mean it to get that big doesn't really help you clean it up.

Yep! And as you pointed out later in you note, PostgreSQL isn't the
only thing running on this computer. There's also a full-fledged
Windows application normally running on it. And the nature of the
manufacturing, black-box context means that maintenance is supposed to
be rare, and that anything which gets us off of a 24/7 work schedule is
enormously expensive.

This has been a fun problem to fix, for sure... We're not there yet,
but I feel like we're really close.

I'm currently trying a hybrid approach, based on several suggestions
that were posted to this list:

Given that during this maintenance operation, nothing else should
running, I'm going to bump up the shared_buffers. Even after we run our
maintenance, the fact that shared_buffers was so ridiculously low
couldn't be helping anything, and I'll push it up.

I finally remembered why I had such a tortured set of subselects in my
original query: If you're going to do a query with LIMIT in it, you had
better be sure that you know what you're doing, because without an ORDER
BY clause, you might be in for surprises. And sure enough, in our
testing, I realized that when we asked the database for up to 5 rows, we
were getting the same rows again and again, thus stopping after it
deleted a few bunches of rows.

So I changed tactics somewhat, and it appears to be working much, much
faster: I first created a table (not a temp table, simply because my
functions are getting invoked by the .NET application in a new
connection each time, and I obviously don't want my table to go away)
with the IDs of the R table that are older than n days old. This
table has about 200,000 rows in it, but each column is an int, so it's
pretty small.

I then have a separate function that takes a parameter, the chunk size.
I loop through the table created in the first function
(old_report_ids), deleting all of the records in the B table that
references the R table. I then remove the row from the old_report_ids
table, and then loop again, until I've reached the chunk size. There
are undoubtedly more elegant ways to do this, but we just gotta get it
working at this point. :-)

We're about to test this, but from my small tests on my computer, it ran
much, much faster than other options. We'll see what happens when we
try it now on the 200 GB monster...

Reuven

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Stefan Keller 2012-02-26 10:56:44 Re: PG as in-memory db? How to warm up and re-populate buffers? How to read in all tuples into memory?
Previous Message Wales Wang 2012-02-26 08:52:07 Re: [PERFORM] PG as in-memory db? How to warm up and re-populate buffers? How to read in all tuples into memory?