WAL partition overloaded--by autovacuum?

From: Richard Yen <richyen(at)iparadigms(dot)com>
To: pgsql-performance(at)postgresql(dot)org, slony List <slony1-general(at)lists(dot)slony(dot)info>
Subject: WAL partition overloaded--by autovacuum?
Date: 2010-07-07 01:10:10
Message-ID: 03D2C5F0-F222-4088-B5E4-8DA688565B70@iparadigms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi everyone,

I'm running 8.4.2 on a CentOS machine, and postgres recently died with signal 6 because the pg_xlog partition filled up (33GB) on 7/4/10 10:34:23 (perfect timing, as I was hiking in the mountains in the remotest parts of our country). I did some digging and found the following:

-- The current state of /db/data/pg_xlog (after postgres died on 7/4/10) indicates there are 2056 files, all created between 7/4/10 10:01 and 7/4/10 10:34, and a quick bash-kung-fu check shows that these files were created anywhere from 3 per minute (10:22) to 106 per minute (10:12)
-- wal_buffers=8MB; commit_siblings=200; commit_delay=0; checkpoint_segments=16
-- All other WAL-related parameters are either defaults or commented out

-- syslog shows that on 7/2/10 from 10:16:11 to 10:21:51, messages similar to this occurred 29 times:
> Jul 2 10:16:11 db4.sac postgres[20526]: [56-1] 2010-07-02 10:16:11.117 PDT [user=,db= PID:20526 XID:0]LOG: checkpoints are occurring too frequently (4 seconds apart)
> Jul 2 10:16:11 db4.sac postgres[20526]: [56-2] 2010-07-02 10:16:11.117 PDT [user=,db= PID:20526 XID:0]HINT: Consider increasing the configuration parameter "checkpoint_segments".

-- On 7/4/10 from 09:09:02 to 09:10:08, the same type of messages occurred 21 times in syslog
-- This message did not occur on 7/3/10 at all

-- Preceding the "checkpoints are occurring too frequently" syslog entries are autovacuum entries:
> Jul 2 10:16:04 db4.sac postgres[11357]: [7-1] 2010-07-02 10:16:04.576 PDT [user=,db= PID:11357 XID:0]LOG: automatic vacuum of table "tii._sac_global.sl_event": index scans: 1
> Jul 2 10:16:04 db4.sac postgres[11357]: [7-2] pages: 0 removed, 25 remain
> Jul 2 10:16:04 db4.sac postgres[11357]: [7-3] tuples: 676 removed, 955 remain
> Jul 2 10:16:04 db4.sac postgres[11357]: [7-4] system usage: CPU 0.01s/0.00u sec elapsed 0.02 sec
> Jul 2 10:16:04 db4.sac postgres[11357]: [8-1] 2010-07-02 10:16:04.580 PDT [user=,db= PID:11357 XID:197431667]LOG: automatic analyze of table "tii._sac_global.sl_event" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul 2 10:16:04 db4.sac postgres[11357]: [9-1] 2010-07-02 10:16:04.965 PDT [user=,db= PID:11357 XID:0]LOG: automatic vacuum of table "tii._sac_global.sl_confirm": index scans: 1
> Jul 2 10:16:04 db4.sac postgres[11357]: [9-2] pages: 0 removed, 154 remain
> Jul 2 10:16:04 db4.sac postgres[11357]: [9-3] tuples: 1834 removed, 8385 remain
> Jul 2 10:16:04 db4.sac postgres[11357]: [9-4] system usage: CPU 0.32s/0.04u sec elapsed 0.37 sec

> Jul 4 09:08:56 db4.sac postgres[21798]: [13-1] 2010-07-04 09:08:56.107 PDT [user=,db= PID:21798 XID:0]LOG: automatic vacuum of table "tii.public.city": index scans: 0
> Jul 4 09:08:56 db4.sac postgres[21798]: [13-2] pages: 0 removed, 151 remain
> Jul 4 09:08:56 db4.sac postgres[21798]: [13-3] tuples: 0 removed, 20223 remain
> Jul 4 09:08:56 db4.sac postgres[21798]: [13-4] system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec
> Jul 4 09:08:56 db4.sac postgres[21798]: [14-1] 2010-07-04 09:08:56.118 PDT [user=,db= PID:21798 XID:0]LOG: automatic vacuum of table "tii.public.country": index scans: 1
> Jul 4 09:08:56 db4.sac postgres[21798]: [14-2] pages: 0 removed, 2 remain
> Jul 4 09:08:56 db4.sac postgres[21798]: [14-3] tuples: 77 removed, 185 remain
> Jul 4 09:08:56 db4.sac postgres[21798]: [14-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul 4 09:08:56 db4.sac postgres[21798]: [15-1] 2010-07-04 09:08:56.335 PDT [user=,db= PID:21798 XID:0]LOG: automatic vacuum of table "tii.public.gm3_clipboard": index scans: 1
> Jul 4 09:08:56 db4.sac postgres[21798]: [15-2] pages: 0 removed, 897 remain
> Jul 4 09:08:56 db4.sac postgres[21798]: [15-3] tuples: 2 removed, 121594 remain
> Jul 4 09:08:56 db4.sac postgres[21798]: [15-4] system usage: CPU 0.07s/0.08u sec elapsed 0.20 sec
...snip...
> Jul 4 09:10:25 db4.sac postgres[22066]: [23-1] 2010-07-04 09:10:25.921 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.pm_assignment": index scans: 1
> Jul 4 09:10:25 db4.sac postgres[22066]: [23-2] pages: 0 removed, 995 remain
> Jul 4 09:10:25 db4.sac postgres[22066]: [23-3] tuples: 323 removed, 83964 remain
> Jul 4 09:10:25 db4.sac postgres[22066]: [23-4] system usage: CPU 0.01s/0.09u sec elapsed 0.52 sec
> Jul 4 09:10:25 db4.sac postgres[22073]: [7-1] 2010-07-04 09:10:25.978 PDT [user=,db= PID:22073 XID:0]LOG: automatic vacuum of table "tii.public.pm_question_type": index scans: 0
> Jul 4 09:10:25 db4.sac postgres[22073]: [7-2] pages: 0 removed, 1 remain
> Jul 4 09:10:25 db4.sac postgres[22073]: [7-3] tuples: 0 removed, 2 remain
> Jul 4 09:10:25 db4.sac postgres[22073]: [7-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul 4 09:10:26 db4.sac postgres[22066]: [24-1] 2010-07-04 09:10:26.301 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.pm_association_rule": index scans: 1
> Jul 4 09:10:26 db4.sac postgres[22066]: [24-2] pages: 0 removed, 286 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [24-3] tuples: 46 removed, 51321 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [24-4] system usage: CPU 0.00s/0.03u sec elapsed 0.34 sec
> Jul 4 09:10:26 db4.sac postgres[22066]: [25-1] 2010-07-04 09:10:26.328 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.role": index scans: 0
> Jul 4 09:10:26 db4.sac postgres[22066]: [25-2] pages: 0 removed, 1 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [25-3] tuples: 0 removed, 5 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [25-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul 4 09:10:26 db4.sac postgres[22066]: [26-1] 2010-07-04 09:10:26.373 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.pm_review_type": index scans: 0
> Jul 4 09:10:26 db4.sac postgres[22066]: [26-2] pages: 0 removed, 1 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [26-3] tuples: 0 removed, 4 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [26-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul 4 09:10:26 db4.sac postgres[22066]: [27-1] 2010-07-04 09:10:26.428 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.permission": index scans: 0
> Jul 4 09:10:26 db4.sac postgres[22066]: [27-2] pages: 0 removed, 1 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [27-3] tuples: 0 removed, 8 remain
> Jul 4 09:10:26 db4.sac postgres[22066]: [27-4] system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec

-- min_duration_statement=1000, so there is a chance that the volume of really-fast write traffic might be masked, but I'm doubtful as this is our low-usage season (we run a web service for the education sector).

-- Another tidbit of information is that I am using slony 2.0.3 for replication, and my sync_interval is 500. However, I'm doubtful that slony traffic is responsible because my other nodes use the exact same config and hardware, and none of them had this issue.

-- My last administrative action on this machine was to SIGHUP this machine to push changes to pg_hba.conf on 6/29/10. Just now, I diff-ed my postgresql.conf file with a copy that I keep in svn--no changes at all.

This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flush WAL files to /db/data/ couldn't keep up, thereby filling up the disk. I'm wondering if anyone else out there might be able to give me some insight or comments to my assessment--is it accurate? Any input would be helpful, and I'll try to make necessary architectural changes to keep this from happening again.

Your help is much appreciated!
--Richard

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2010-07-07 03:25:28 Re: WAL partition overloaded--by autovacuum?
Previous Message Mark Wong 2010-07-07 00:35:43 Re: using dbt2 postgresql 8.4 - rampup time issue