Skip site navigation (1) Skip section navigation (2)

significant vacuum issues - looking for suggestions

From: Kevin Kempter <kevin(at)kevinkempterllc(dot)com>
To: "Pgsql performance" <pgsql-performance(at)postgresql(dot)org>
Subject: significant vacuum issues - looking for suggestions
Date: 2007-08-24 19:57:23
Message-ID: (view raw or flat)
Lists: pgsql-performance
Hi List;

I've just started working with a client that has been running Postgres (with 
no DBA) for a few years. They're running  version 8.1.4 on 4-way dell boxes 
with 4Gig of memory on each box attached to RAID-10 disk arrays. 

Some of their key config settings are here:
shared_buffers = 20480
work_mem = 16384
maintenance_work_mem = 32758
wal_buffers = 24
checkpoint_segments = 32
checkpoint_timeout = 300
checkpoint_warning = 30
effective_cache_size = 524288
autovacuum = on
autovacuum_naptime = 60
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 250
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1
autovacuum_vacuum_cost_delay = -1
autovacuum_vacuum_cost_limit = -1

Currently I've found that they have approx 17 tables that show a significant 
amount of bloat in the system. The worst one showing over 5million  pages 
worth of dead space. One of the problems is that their system is so busy with 
activity during the day and massive data load processes at night that they 
often kill the pid of vacuum processes kicked off by autovacuum because the 
overall load impact disallows users from logging into the app since the login 
process includes at least one db query that then seems to hang because there 
are anywhere from 100 - 300 queries ahead of it at any given time. Normally a 
user gets logged in with an avg wait of 5 - 10 seconds but when a long 
running vacuum (sometimes due to a long running update process that's trying 
to sort/update > 40million rows)  is going the system gets to a state where 
the login queries never get executed until the vacuum process is killed.

As a result of this I believe that the biggest table (the one with > 5million 
pages worth of dead space) has never been vacuumed to completion. I suspect 
this is the case for a few of the other top dead space tables as well but I 
can't be sure. 

My first priority was to get this vacuum scenario cleaned up. First off I 
added the biggest table into pg_autovacuum and set the enabled column to 
false ('f'). Then I set vacuum_cost_delay to 10 and in the same session 
ran "vacuum analyze verbose big_table".  This ran for 7.5 hours before we had 
to kill it due to system load - and to make matters worse the high system 
load was forcing many of the nightly batch queries that load, update, etc the 
data to stack up to a point where the system was at less than 2% idle (CPU) 
for the next 4 hours and barely responding to the command line.

To make matters worse I find out this morning that the db is at 85% per used 
transaction ID's - again since a vacuum on the entire db has never been 

As far as I can tell, the overall db size is currently 199G of which approx 
104G seems to be valid data.

Here's my thoughts per how to proceed:

1) fix the big table ASAP (probably over the weekend) since it's not only the 
biggest table but the most active like this:

   a) run a pg_dump of this table

   b)  restore this dump into a new table (i.e. new_big_table)

   c) lock the original big_table, sync any changes, inserts, deletes since we 
did the dump from big_table into new_big_table

   d) drop big_table

   e) re-name new_big_table to big_table

* I may run through this for a few of the other large, highly active tables 
that have minimal page density as well.

The development folks that have been here awhile tell me that it seems like 
when they have a query (not limited to vacuum processes) that has been 
running for a long time (i.e. > 5 or 6 hours) that the query sort of "goes 
crazy" and the entire system gets pegged until they kill that process. - I've 
not heard of this but I suspect upgrading to 8.2.4 is probably a good plan at 
this point as well, so for step 2, I'll do this:

2) (obviously I'll do this in dev first, then in QA and finally in prod)
   a) install verson 8.2.4 from source, leaving 8.1.4 in place

   b) create the new 8.2.4 cluster on a new port

   c) setup WAL archiving on the 8.1.4 cluster

   d) do a full dump of the 8.1.4 cluster and restore it to the new 8.2.4 

   e) stop the 8.2.4 cluster and bring it up in recovery mode, pointing it to 
the directory where we're archiving the 8.1.4 cluster's WAL segments.

   f) once caught up, bring both clusters down

   g) copy any final files from the 8.1.4 cluster's pg_xlog directory into the 
new 8.2.4 pg_xlog dir (is this ok, since I'm moving 8.1.4 version tx logs 
into an 8.2.4 xlog dir?)

   h) Change the port on the 8.2.4 cluster to what the original 8.1.4 cluster 
port was

   i) bring up the new 8.2.4 system, and actively manage the vacuum needs 
moving fwd via a combination of autovacuum, cron processes for specififed 
table vac's (daily, hourly, 15min, 5min, etc), and as needed interactive 
session vacuums

The src based install will allow me to setup a robust upgrade CM process 
capable of supporting multiple concurrent versions on a server if needed, the 
ability to quickly revert to a previous version, etc however this is a 
discussion for another day - I only mention it in case the question "why not 
just use RPM's?" arises...

So here's my questions:

1) Does this sound like a good plan?

2) Are there other steps I should be taking, other Issues I should be 
concerned about short-term, etc? 

3) Does anyone have any additional advice for managing either this initial 
mess, or the system(s) long term?

Thanks in advance...



pgsql-performance by date

Next:From: Gregory StarkDate: 2007-08-24 20:06:53
Subject: Re: When/if to Reindex
Previous:From: Steven FlattDate: 2007-08-24 17:49:01
Subject: Re: When/if to Reindex

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group