Re: PANIC: rename from /data/pg_xlog/0000002200000009

From: "Yurgis Baykshtis" <ybaykshtis(at)micropat(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PANIC: rename from /data/pg_xlog/0000002200000009
Date: 2003-11-25 23:41:29
Message-ID: 001b01c3b3ad$a700afa0$a5936e3f@aurigin.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I get the feeling that what we will see is the destination
> filename already present and the source not, which would suggest
> that two backends tried to do the rename concurrently.

Tom,

I just noticed that the rename panic errors like this one:

PANIC: rename from /data/pg_xlog/000000030000001F to
/data/pg_xlog/000000030000002C (initialization of log file 3, segment 44)
failed: No such file or directory

come shortly AFTER the following messages

LOG: recycled transaction log file 000000030000001B
LOG: recycled transaction log file 000000030000001C
LOG: recycled transaction log file 000000030000001D
LOG: recycled transaction log file 000000030000001E
LOG: removing transaction log file 000000030000001F
LOG: removing transaction log file 0000000300000020
LOG: removing transaction log file 0000000300000021
LOG: removing transaction log file 0000000300000022

So, you can see that 000000030000001F file was previously deleted by the
logic in MoveOfflineLogs() function.
Now what I can see is that MoveOfflineLogs() does not seem to be
synchronized between backends.
The xlog directory reading loop is not synchronized itself and the caller
code is not synchronized either:

CreateCheckPoint(bool shutdown, bool force)
...
LWLockRelease(ControlFileLock);

/*
* We are now done with critical updates; no need for system panic if
* we have trouble while fooling with offline log segments.
*/
END_CRIT_SECTION();

/*
* Delete offline log files (those no longer needed even for previous
* checkpoint).
*/
if (_logId || _logSeg)
{
PrevLogSeg(_logId, _logSeg);
MoveOfflineLogs(_logId, _logSeg, recptr);
}
...

So is it possible that due to the lack of synchronization, two backends call
MoveOfflineLogs() simultaneously?
For example, first backend has unlinked the log segment file and then the
second one tries to rename the same file because it was returned by
readdir() function before it got deleted by the first beckend.

However, scenario seems to be hard to materialize since it must happen in a
very short timeframe.
The "remove" and "rename" log messages look separated in time.
Also, we have a suspicion that the problem happens even with only one client
connected to postgres.

Thanks

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-11-26 00:22:38 Re: PANIC: rename from /data/pg_xlog/0000002200000009
Previous Message ohp 2003-11-25 22:49:57 7.4final regression failure on uw713