PG 7.1.2 Crash: cannot read xlog dir

From: kay <efesar(at)nmia(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: PG 7.1.2 Crash: cannot read xlog dir
Date: 2003-05-31 21:29:34
Message-ID: NGBBKFMOILMAGDABPFEGEEADENAA.efesar@nmia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


My situation: PG 7.1.2, Redhat 7.2, running in a chroot jail on a "VDS"
server at my new ISP. I can't recompile anything, can't upgrade PG
(basically, I'm stuck with 7.1.2).

This issue was previously noted in a thread in late 2002. The actual thread
that Tom Lane suggests it might be a permissions issue is missing from the
archive, but I found it in Google's cache ( for two Webcrawler docs:
http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22cannot+read+xlog+d
ir%22+&btnG=Google+Search ). As to why they aren't on
archives.postgresql.org ... ya got me.

I changed permissions to the most permissive setting I know (0777), plus I
own the directory, I own the files, and I own the postmaster process, so the
only thing I can think is that 'readdir' is badly linked or has some freaky
kernel interaction. I have Python, perl and PHP on the system, and they all
use 'opendir' and 'readdir' and 'closedir' just fine on the pg_xlog
directory.

My problem: I've deduced that the 'readdir' call is broken in my PG. I
examined the source code for 7.1 very very thoroughly (
http://developer.postgresql.org/cvsweb.cgi/pgsql-server/src/backend/access/t
ransam/xlog.c?rev=1.65.2.1&content-type=text/x-cvsweb-markup&only_with_tag=R
EL7_1_STABLE see MoveOfflineLogs). What I've found is that 'opendir' seems
to open the directory fine (does not return a NULL value), but when
'readdir' tries to grab a filename something bombs with a file system error
'No such file or directory' and it returns a NULL and 'errno' gets set. The
strange thing is that it gets in there ONCE and does ONE file
(0000000000000000) and then it won't do anymore, ever again, until I stop
the server and run initdb again.

At this point I know that there's nothing wrong with the XLOG directory or
the files in it, because PG has been writing transactions fine for 7-8 hours
up to this point. It can only be a bad 'readdir' call.

My question: Is there some runtime setting I can use to prevent
MoveOfflineLogs() from ever being called? I would MUCH rather have a couple
of old XLOGs lying around than a fatal crash. Maybe by CHECKPOINTing every
hour or something ... I've tried playing with a bunch of different WAL
settings and ... I can't stop MoveOfflineLogs from being called.

Please keep in mind my hands are tied, and I can't recompile and I can't
upgrade. Even if I could upgrade, I imagine that 'readdir' would still be
broken, and I'd still have this issue.

If anybody can think of a workaround I'd really appreciate it. I've been
racking my brain on this for a week.

Thanks

-Keith

==================

Here's the log.

/usr/local/pgsql/bin/postmaster: reaping dead processes...
/usr/local/pgsql/bin/postmaster: CleanupProc: pid 24626 exited with status 0
XLogFlush: rqst 0/12259528; wrt 0/0; flsh 0/0
XLogFlush: rqst 0/17078212; wrt 0/17078248; flsh 0/17078248
XLogFlush: rqst 0/17078152; wrt 0/17078248; flsh 0/17078248
XLogFlush: rqst 0/0; wrt 0/17078248; flsh 0/17078248
INSERT @ 0/17078248: prev 0/17078212; xprev 0/0; xid 0: XLOG - checkpoint:
redo 0/17078248; undo 0/0; sui 28; xid 3495; oid 36603; online
XLogFlush: rqst 0/17078312; wrt 0/17078248; flsh 0/17078248
DEBUG: MoveOfflineLogs: remove 0000000000000000
FATAL 2: MoveOfflineLogs: cannot read xlog dir: No such file or directory
DEBUG: proc_exit(2)
DEBUG: shmem_exit(2)
DEBUG: exit(2)
/usr/local/pgsql/bin/postmaster: reaping dead processes...
/usr/local/pgsql/bin/postmaster: CleanupProc: pid 24736 exited with status
512
Server process (pid 24736) exited with status 512 at Sat May 31 09:57:57
2003
Terminating any active server processes...
Server processes were terminated at Sat May 31 09:57:57 2003
Reinitializing shared memory and semaphores
invoking IpcMemoryCreate(size=1236992)
DEBUG: database system was interrupted at 2003-05-31 09:57:57 EDT
DEBUG: CheckPoint record at (0, 17078248)
DEBUG: Redo record at (0, 17078248); Undo record at (0, 0); Shutdown FALSE
DEBUG: NextTransactionId: 3495; NextOid: 36603
DEBUG: database system was not properly shut down; automatic recovery in
progress...
DEBUG: ReadRecord: record with zero len at (0, 17078312)
DEBUG: redo is not required
INSERT @ 0/17078312: prev 0/17078248; xprev 0/0; xid 0: XLOG - checkpoint:
redo 0/17078312; undo 0/0; sui 28; xid 3495; oid 36603; shutdown
XLogFlush: rqst 0/17078376; wrt 0/17078312; flsh 0/17078312
FATAL 2: MoveOfflineLogs: cannot read xlog dir: No such file or directory
DEBUG: proc_exit(2)
DEBUG: shmem_exit(2)
DEBUG: exit(2)

=========================

Here's the code from 7.1.

static void
MoveOfflineLogs(uint32 log, uint32 seg)
{
DIR *xldir;
struct dirent *xlde;
char lastoff[32];
char path[MAXPGPATH];

Assert(XLOG_archive_dir[0] == 0); /* ! implemented yet */

xldir = opendir(XLogDir);
if (xldir == NULL)
elog(STOP, "MoveOfflineLogs: cannot open xlog dir: %m");

sprintf(lastoff, "%08X%08X", log, seg);

errno = 0;
while ((xlde = readdir(xldir)) != NULL)
{
if (strlen(xlde->d_name) == 16 &&
strspn(xlde->d_name, "0123456789ABCDEF") == 16 &&
strcmp(xlde->d_name, lastoff) <= 0)
{
elog(LOG, "MoveOfflineLogs: %s %s",
(XLOG_archive_dir[0]) ?
"archive" : "remove", xlde->d_name);
sprintf(path, "%s%c%s", XLogDir, SEP_CHAR,
xlde->d_name);
if (XLOG_archive_dir[0] == 0)
unlink(path);
}
errno = 0;
}
if (errno)
elog(STOP, "MoveOfflineLogs: cannot read xlog dir: %m");
closedir(xldir);
}

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Douglas Trainor 2003-05-31 22:03:29 Re: Yet another postgres scaling question (use on AMD Opteron)
Previous Message Fred Moyer 2003-05-31 21:09:45 Re: Yet another postgres scaling question (use on AMD Opteron)