PG 7.2.4 and waiting backends

From: Rob Fielding <rob(at)dsvr(dot)net>
To: pgsql-admin(at)postgresql(dot)org
Subject: PG 7.2.4 and waiting backends
Date: 2004-03-30 11:43:24
Message-ID: 40695D5C.7090101@dsvr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi,

Currently experiencing a problem, apparently triggered by vacuumdb -a -f
which results in all backends (including the vacuum) becoming locked
with a status of 'waiting'.

The intention is to migrate the DB from 7.2 to a later version. The
interim has been to get up to the latest 7.2.4 release inorder to
circumvent a vacuumdb corner case which was resulting in a FULL failing.
The upgrade to 7.2.4 resolved this only to present a new issue. That
after a significant runtime (300 minutes) all backends processing
selects, inserts and even the vacuum itself get stuck 'waiting'. The
result is that all backends become stuck and we run out of backends.

The reason why we have a need to perform a full vacuum is that a problem
with one of our own subsystems handling data inserts caused the db size
to rise from 25GB to 70GB. Unfortunately the nature of the data means
duplicates are a valid case. We have issued a script which hunts and
removes this additional data and this has completed, but pg wont free
this space back to the filesystem until a full vacuum is performed. I'd
like to see where we are up by looking at the real space taken up by the
system. After 5 hours of running vacuum full, before the freeze, the db
is 68GB, so we either still have lots of bogus dupes or we are not
anywhere near completing the vacuum.

The system is live, so I don't have too much to go on right now as I've
only tried the full vacuum 3 times due to the long runtime and the
effects being quite catastrophic if not caught at the right time. I'm at
the first stage of attempting to diagnose the problem and understand the
details I've provided aren't particularly scientific, but I'd like to
try and gain some further insight.

The final logs follow. Notice that prior to the problem, the same
pg_toast_64432 object is being processed. I'm uncertain of the
significance of this as I don't fully understand the log output.
Interestingly we suddenly run alot of log recycling right after this
point (see checkpoint_segments setting). It is after this point we stop.
At which point I manually perform a fast shutdown, as indicated in the
logs after my snip. Resuming the database is fine.

2004-03-29 13:50:58 [14147] NOTICE: --Relation pg_toast_64432--
2004-03-29 13:50:59 [14147] NOTICE: Pages 975: Changed 334, reaped
275, Empty 0, New 0; Tup 4562: Vac 0, Keep/VTL 630/0, UnUsed 338, MinLen
45, MaxLen 2034; Re-using: Free/Avail. Space 376416/372248;
EndEmpty/Avail. Pages 0/673.
CPU 0.06s/0.00u sec elapsed 1.33 sec.
2004-03-29 13:51:00 [14147] NOTICE: Index pg_toast_64432_idx: Pages
161; Tuples 4562: Deleted 0.
CPU 0.01s/0.00u sec elapsed 0.76 sec.
2004-03-29 13:51:00 [14147] NOTICE: Rel pg_toast_64432: Pages: 975 -->
974; Tuple(s) moved: 5.
CPU 0.01s/0.05u sec elapsed 0.07 sec.
2004-03-29 13:51:00 [14147] NOTICE: Index pg_toast_64432_idx: Pages
161; Tuples 4562: Deleted 5.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
2004-03-29 13:51:00 [14147] NOTICE: Analyzing mail_16
2004-03-29 13:56:42 [25752] DEBUG: recycled transaction log file
000000AB00000001
2004-03-29 13:56:42 [25752] DEBUG: recycled transaction log file
000000AB00000002
2004-03-29 13:56:42 [25752] DEBUG: recycled transaction log file
000000AB00000003
2004-03-29 13:56:42 [25752] DEBUG: recycled transaction log file
000000AB00000004
<snip total of 32 recycled logs>
2004-03-29 13:58:30 [17415] ERROR: Query was cancelled.
2004-03-29 13:59:19 [23878] DEBUG: fast shutdown request
2004-03-29 13:59:19 [23878] DEBUG: aborting any active transactions

next log:

2004-03-29 17:42:12 [26480] DEBUG: --Relation pg_toast_64432--
2004-03-29 17:42:13 [26480] DEBUG: Pages 984: Changed 10, reaped 477,
Empty 0, New 0; Tup 3977: Vac 630, Keep/VTL 0/0, UnUsed 338, MinLen 45,
MaxLen 2034; Re-using: Free/Avail. Space 1465424/1462296;
EndEmpty/Avail. Pages 0/739.
CPU 0.05s/0.00u sec elapsed 1.62 sec.
2004-03-29 17:42:14 [26480] DEBUG: Index pg_toast_64432_idx: Pages
161; Tuples 3977: Deleted 630.
CPU 0.00s/0.00u sec elapsed 0.58 sec.
2004-03-29 17:42:14 [26480] DEBUG: Rel pg_toast_64432: Pages: 984 -->
841; Tuple(s) moved: 542.
CPU 0.04s/0.13u sec elapsed 0.20 sec.
2004-03-29 17:42:14 [26480] DEBUG: Index pg_toast_64432_idx: Pages
163; Tuples 3977: Deleted 542.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
2004-03-29 17:51:44 [16371] DEBUG: recycled transaction log file
000000AD0000009E
2004-03-29 17:51:44 [16371] DEBUG: recycled transaction log file
000000AD0000009F
2004-03-29 17:51:44 [16371] DEBUG: recycled transaction log file
000000AD000000A0
<snip total of 32 recycled logs>
2004-03-29 18:32:25 [26346] DEBUG: fast shutdown request
2004-03-29 18:32:25 [26346] DEBUG: aborting any active transactions

Pertinent config :

max_fsm_relations = 1000
max_fsm_pages = 20000
vacuum_mem = 65536
effective_cache_size = 95694
random_page_cost = 2
sort_mem=65536
max_connections = 128
shared_buffers = 15732
wal_buffers = 64 # need to determin
wal_files = 64 # range 0-64
wal_sync_method = fsync # the default varies across platforms:
wal_debug = 0 # range 0-16

# hopefully this should see less LogFlushes per LogInsert - use more WAL
though.
commit_delay = 10000 # range 0-100000
commit_siblings = 2 # range 1-1000

checkpoint_segments = 32 # in logfile segments (16MB each), min 1
checkpoint_timeout = 600 # in seconds, range 30-3600
fsync = false # scotty!
#fsync = true

Prior to this new incident, I've been successfully playing around with
the WAL set-up thanks to guys on the performance list (see post in
performance titled "[PERFORM] WAL Optimisation - configuration and
usage"). I'm still experiementing with good settings for checkpoint_seg
and timeout however the significance of recycling and
checkpoint_segments worries me slightly. I shall try lower values in the
mean time.

Best regards,

--

Rob Fielding
rob(at)dsvr(dot)net

www.dsvr.co.uk Development Designer Servers Ltd

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Rob Fielding 2004-03-30 11:51:17 Re: Spontaneous PostgreSQL Server Reboot?
Previous Message Geoffrey 2004-03-30 10:55:52 Re: Spontaneous PostgreSQL Server Reboot?