Re: BUG #2104: pg_xlog/ trace files not reclaimed by server

From: Reuben Pasquini <pasquini(at)Imageworks(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org, pasquini(at)imageworks(dot)com
Subject: Re: BUG #2104: pg_xlog/ trace files not reclaimed by server
Date: 2005-12-09 19:16:06
Message-ID: 4399D7F6.9070003@Imageworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Tom,

Unfortunately, I had to rebuild the database pretty quickly
to get the app it supports back up,
and I wiped out the log files.
The postmaster would dump core on startup - so I
just wiped the database, and recreated the schema
(my app can rebuild its state).
I didn't save the core file either - but it wasn't
not very informative - something like:

abort ();
failedToStartupChildProcesses ();
...

I'll save the logs and core if this happens again.
I may have misdiagnosed the cause of the problem,
but the pg_xlog/ directory does fill up with files until
the disk is full - 8 GB worth - when it enters this unstable
state. I thought it might be related to exceeding
2*checkpoint_segments + 1
because of what I read here:

http://www.postgresql.org/docs/8.1/static/wal-configuration.html

-------------

There will be at least one WAL segment file, and will normally not be
more than 2 * checkpoint_segments + 1 files. Each segment file is
normally 16 MB (though this size can be altered when building the
server). You can use this to estimate space requirements for WAL.
Ordinarily, when old log segment files are no longer needed, they are
recycled (renamed to become the next segments in the numbered sequence).
If, due to a short-term peak of log output rate, there are more than 2 *
checkpoint_segments + 1 segment files, the unneeded segment files will
be deleted instead of recycled until the system gets back under this limit.

-------------

The database ran fine for the last 10 days - seemed to recycle the
trace files under pg_xlog/ fine. I was monitoring the
pg_xlog/ directory, because I knew that was what grew out of control
the last crash. I noticed at the end of the day yesterday that
we had grown up to 25 trace files, and I knew my checkpoint-segment
was set to '12'.

Anyway - I could be completely off here.
I uped my checkpoint_segment setting to 30 (still have a 5 minute timeout),
and trace files are recycling ok:

$ ls -lrt pg_xlog/
total 278872
drwx------ 2 monitor user 4096 Dec 9 04:59 archive_status
-rw------- 1 monitor user 16777216 Dec 9 11:00 0000000100000001000000C9
-rw------- 1 monitor user 16777216 Dec 9 11:00 0000000100000001000000C8
-rw------- 1 monitor user 16777216 Dec 9 11:02 0000000100000001000000C7
-rw------- 1 monitor user 16777216 Dec 9 11:05 0000000100000001000000B9
-rw------- 1 monitor user 16777216 Dec 9 11:05 0000000100000001000000BA
-rw------- 1 monitor user 16777216 Dec 9 11:06 0000000100000001000000BB
-rw------- 1 monitor user 16777216 Dec 9 11:07 0000000100000001000000BC
-rw------- 1 monitor user 16777216 Dec 9 11:07 0000000100000001000000BD
-rw------- 1 monitor user 16777216 Dec 9 11:08 0000000100000001000000BE
-rw------- 1 monitor user 16777216 Dec 9 11:09 0000000100000001000000BF
-rw------- 1 monitor user 16777216 Dec 9 11:09 0000000100000001000000C0
-rw------- 1 monitor user 16777216 Dec 9 11:10 0000000100000001000000C1
-rw------- 1 monitor user 16777216 Dec 9 11:11 0000000100000001000000C2
-rw------- 1 monitor user 16777216 Dec 9 11:11 0000000100000001000000C3
-rw------- 1 monitor user 16777216 Dec 9 11:12 0000000100000001000000C4
-rw------- 1 monitor user 16777216 Dec 9 11:13 0000000100000001000000C5
-rw------- 1 monitor user 16777216 Dec 9 11:13 0000000100000001000000C6

We'll see how that goes. I'll let you know if I get another crash.

Cheers,
Reuben

Tom Lane wrote:

>"Reuben Pasquini" <pasquini(at)imageworks(dot)com> writes:
>
>
>>It appears that when my 8.1 database is forced
>>to generate more than
>> 2*checkpoint_segments + 1
>>trace files under pg_xlog/, that the database
>>becomes confused and stops recycling the trace files.
>>
>>
>
>That's fairly hard to believe, especially since you haven't presented
>any actual evidence of it.
>
>What might have happened is that checkpoints were failing for some
>reason and so recycling of WAL segments couldn't be performed. Was
>there anything in the postmaster log about write failures?
>
> regards, tom lane
>
>
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2005-12-09 21:23:26 Re: There is a doubt of fatal bug on PostgreSQL 8.0.4.
Previous Message bugrep 2005-12-09 19:03:48 BUG #2106: EXPLAIN ANALYZE with SELECT query causes a single backend server process to segfault