WAL partition filling up after high WAL activity

From: Richard Yen <richyen(at)iparadigms(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: WAL partition filling up after high WAL activity
Date: 2011-11-07 22:18:46
Message-ID: CAKWMdJaEEiz1png_0groZW_9r3LbRNY8+vog+V8FyDN6meB90w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Everyone,

I recently saw a crash on one of our databases, and I was wondering if this
might be an indication of something with WAL that might be unexpectedly
creating more files than it needs to?

Nov 5 16:18:27 localhost postgres[25092]: [111-1] 2011-11-05 16:18:27.524
PDT [user=slony,db=uk dbhub.com <http://dbhub.sac.iparadigms.com/>(35180)
PID:25092 XID:2142895751]PANIC: could not write to file
"pg_xlog/xlogtemp.25092": No space left on device
Nov 5 16:18:27 localhost postgres[25092]: [111-2] 2011-11-05 16:18:27.524
PDT [user=slony,db=uk dbhub.com <http://dbhub.sac.iparadigms.com/>(35180)
PID:25092 XID:2142895751]STATEMENT: select "_sac_uk".forwardConfirm(2, 4, '
5003717188', '2011-11-05 16:18:26.977112');
Nov 5 16:18:27 localhost postgres[32121]: [7-1] 2011-11-05 16:18:27.531
PDT [user=,db= PID:32121 XID:0]LOG: server process (PID 25092) was
terminated by signal 6: Aborted
Nov 5 16:18:27 localhost postgres[32121]: [8-1] 2011-11-05 16:18:27.531
PDT [user=,db= PID:32121 XID:0]LOG: terminating any other active server
processes

If you look at this graph (http://cl.ly/2y0W27330t3o2J281H3K), the
partition actually fills up, and the logs show that postgres crashed.
After postgres crashed, it automatically restarted, cleared out its WAL
files, and began processing things again at 4:30PM.

From the graph, it looks like a vacuum on table m_dg_read finished at
4:08PM, which might explain why the downward slope levels off for a few
minutes:

> Nov 5 16:08:03 localhost postgres[18741]: [20-1] 2011-11-05 16:08:03.400
> PDT [user=,db= PID:18741 XID:0]LOG: automatic vacuum of table
> "uk.public.m_dg_read": index scans: 1
> Nov 5 16:08:03 localhost postgres[18741]: [20-2] pages: 0 removed,
> 65356 remain
> Nov 5 16:08:03 localhost postgres[18741]: [20-3] tuples: 31770
> removed, 1394263 remain
> Nov 5 16:08:03 localhost postgres[18741]: [20-4] system usage: CPU
> 2.08s/5.35u sec elapsed 619.39 sec

Looks like right afterwards, it got started on table m_object, which
finished at 4:18PM:

> Nov 5 16:18:19 localhost postgres[18686]: [9-1] 2011-11-05 16:18:19.448
> PDT [user=,db= PID:18686 XID:0]LOG: automatic vacuum of table
> "uk.public.m_object": index scans: 1
> Nov 5 16:18:19 localhost postgres[18686]: [9-2] pages: 0 removed,
> 152862 remain
> Nov 5 16:18:19 localhost postgres[18686]: [9-3] tuples: 17084
> removed, 12455761 remain
> Nov 5 16:18:19 localhost postgres[18686]: [9-4] system usage: CPU
> 4.55s/15.09u sec elapsed 1319.98 sec

It could very well be the case that upon the finish of m_object's vacuum,
another vacuum was beginning, and it eventually just crashed because there
was no room for another vacuum to finish.

We encountered a situation like this last summer on 7/4/2010 for a
different database cluster -- a big vacuum-for-wraparound on at 15GB table
filled the pg_xlog partition--this is how we started monitoring the pg_xlog
file size and wraparound countdown. Seems like there was some sort of
vacuum-for-wraparound process happening during the time of this crash, as
we also track the XID to see when we should expect a VACUUM FREEZE (
http://cl.ly/3s1S373I0l0v3E171Z0V).

Some configs:
checkpoint_segments=16
wal_buffers=8MB
#archive_mode=off
checkpoint_completion_target=0.9

Postgres version is 8.4.5

Note also that the pg_xlog partition is 9.7GB. No other apps run on the
machine besides pgbouncer, so it's highly unlikely that files are written
to this partition by another process. Also, our five largest tables are
the following:
gm3_load_times | 2231 MB
m_object_paper | 1692 MB
m_object | 1192 MB
m_report_stats | 911 MB
gm3_mark | 891 MB

My biggest question is: we know from the docs that there should be no more
than (2 + checkpoint_completion_target) * checkpoint_segments + 1 files.
For us, that would mean no more than 48 files, which equates to 384MB--far
lower than the 9.7GB partition size. **Why would WAL use up so much disk
space?**

Thanks for reading, and thanks in advance for any help you may provide.
--Richard

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2011-11-07 22:51:54 Re: Subquery in a JOIN not getting restricted?
Previous Message Kevin Grittner 2011-11-07 21:53:33 Re: Subquery in a JOIN not getting restricted?