Skip site navigation (1) Skip section navigation (2)

PG 7.2.4 and waiting backends

From: Rob Fielding <rob(at)dsvr(dot)net>
To: pgsql-admin(at)postgresql(dot)org
Subject: PG 7.2.4 and waiting backends
Date: 2004-03-30 11:43:24
Message-ID: 40695D5C.7090101@dsvr.net (view raw or flat)
Thread:
Lists: pgsql-admin
Hi,

Currently experiencing a problem, apparently triggered by vacuumdb -a -f 
   which results in all backends (including the vacuum) becoming locked 
with a status of 'waiting'.

The intention is to migrate the DB from 7.2 to a later version. The 
interim has been to get up to the latest 7.2.4 release inorder to 
circumvent a vacuumdb corner case which was resulting in a FULL failing. 
The upgrade to 7.2.4 resolved this only to present a new issue. That 
after a significant runtime (300 minutes) all backends processing 
selects, inserts and even the vacuum itself get stuck 'waiting'. The 
result is that all backends become stuck and we run out of backends.

The reason why we have a need to perform a full vacuum is that a problem 
with one of our own subsystems handling data inserts caused the db size 
to rise from 25GB to 70GB. Unfortunately the nature of the data means 
duplicates are a valid case. We have issued a script which hunts and 
removes this additional data and this has completed, but pg wont free 
this space back to the filesystem until a full vacuum is performed. I'd 
like to see where we are up by looking at the real space taken up by the 
system. After 5 hours of running vacuum full, before the freeze, the db 
is 68GB, so we either still have lots of bogus dupes or we are not 
anywhere near completing the vacuum.

The system is live, so I don't have too much to go on right now as I've 
only tried the full vacuum 3 times due to the long runtime and the 
effects being quite catastrophic if not caught at the right time. I'm at 
the first stage of attempting to diagnose the problem and understand the 
details I've provided aren't particularly scientific, but I'd like to 
try and gain some further insight.

The final logs follow. Notice that prior to the problem, the same 
pg_toast_64432 object is being processed. I'm uncertain of the 
significance of this as I don't fully understand the log output. 
Interestingly we suddenly run alot of log recycling right after this 
point (see checkpoint_segments setting). It is after this point we stop. 
At which point I manually perform a fast shutdown, as indicated in the 
logs after my snip. Resuming the database is fine.


2004-03-29 13:50:58 [14147]  NOTICE:  --Relation pg_toast_64432--
2004-03-29 13:50:59 [14147]  NOTICE:  Pages 975: Changed 334, reaped 
275, Empty 0, New 0; Tup 4562: Vac 0, Keep/VTL 630/0, UnUsed 338, MinLen 
45, MaxLen 2034; Re-using: Free/Avail. Space 376416/372248; 
EndEmpty/Avail. Pages 0/673.
         CPU 0.06s/0.00u sec elapsed 1.33 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Index pg_toast_64432_idx: Pages 
161; Tuples 4562: Deleted 0.
         CPU 0.01s/0.00u sec elapsed 0.76 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Rel pg_toast_64432: Pages: 975 --> 
974; Tuple(s) moved: 5.
         CPU 0.01s/0.05u sec elapsed 0.07 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Index pg_toast_64432_idx: Pages 
161; Tuples 4562: Deleted 5.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Analyzing mail_16
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file 
000000AB00000001
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file 
000000AB00000002
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file 
000000AB00000003
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file 
000000AB00000004
<snip total of 32 recycled logs>
2004-03-29 13:58:30 [17415]  ERROR:  Query was cancelled.
2004-03-29 13:59:19 [23878]  DEBUG:  fast shutdown request
2004-03-29 13:59:19 [23878]  DEBUG:  aborting any active transactions

next log:

2004-03-29 17:42:12 [26480]  DEBUG:  --Relation pg_toast_64432--
2004-03-29 17:42:13 [26480]  DEBUG:  Pages 984: Changed 10, reaped 477, 
Empty 0, New 0; Tup 3977: Vac 630, Keep/VTL 0/0, UnUsed 338, MinLen 45, 
MaxLen 2034; Re-using: Free/Avail. Space 1465424/1462296; 
EndEmpty/Avail. Pages 0/739.
         CPU 0.05s/0.00u sec elapsed 1.62 sec.
2004-03-29 17:42:14 [26480]  DEBUG:  Index pg_toast_64432_idx: Pages 
161; Tuples 3977: Deleted 630.
         CPU 0.00s/0.00u sec elapsed 0.58 sec.
2004-03-29 17:42:14 [26480]  DEBUG:  Rel pg_toast_64432: Pages: 984 --> 
841; Tuple(s) moved: 542.
         CPU 0.04s/0.13u sec elapsed 0.20 sec.
2004-03-29 17:42:14 [26480]  DEBUG:  Index pg_toast_64432_idx: Pages 
163; Tuples 3977: Deleted 542.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file 
000000AD0000009E
2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file 
000000AD0000009F
2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file 
000000AD000000A0
<snip total of 32 recycled logs>
2004-03-29 18:32:25 [26346]  DEBUG:  fast shutdown request
2004-03-29 18:32:25 [26346]  DEBUG:  aborting any active transactions

Pertinent config :

max_fsm_relations = 1000
max_fsm_pages = 20000
vacuum_mem = 65536
effective_cache_size = 95694
random_page_cost = 2
sort_mem=65536
max_connections = 128
shared_buffers = 15732
wal_buffers = 64 # need to determin
wal_files = 64 # range 0-64
wal_sync_method = fsync   # the default varies across platforms:
wal_debug = 0             # range 0-16
 

# hopefully this should see less LogFlushes per LogInsert - use more WAL 
though.
commit_delay = 10000     # range 0-100000
commit_siblings = 2       # range 1-1000
 

checkpoint_segments = 32  # in logfile segments (16MB each), min 1
checkpoint_timeout = 600  # in seconds, range 30-3600
fsync = false # scotty!
#fsync = true

Prior to this new incident, I've been successfully playing around with 
the WAL set-up thanks to guys on the performance list (see post in 
performance titled "[PERFORM] WAL Optimisation - configuration and 
usage"). I'm still experiementing with good settings for checkpoint_seg 
and timeout however the significance of recycling and 
checkpoint_segments worries me slightly. I shall try lower values in the 
mean time.

Best regards,

-- 

Rob Fielding
rob(at)dsvr(dot)net

www.dsvr.co.uk              Development             Designer Servers Ltd

Responses

pgsql-admin by date

Next:From: Rob FieldingDate: 2004-03-30 11:51:17
Subject: Re: Spontaneous PostgreSQL Server Reboot?
Previous:From: GeoffreyDate: 2004-03-30 10:55:52
Subject: Re: Spontaneous PostgreSQL Server Reboot?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group