Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.

From: Luke Koops <luke(dot)koops(at)entrust(dot)com>
To: 'Heikki Linnakangas' <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-12 03:37:37
Message-ID: A3144629B5AC714A8BF27806EBFA7057514623F2@sottexch7.corp.ad.entrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I picked up the patch and verified both fixes on 8.3.7.

In one test, Handles to two different WAL files were being held by two different backends. The WAL files were renamed to .deleted after I forced a switch xlog. Eventually the .deleted files disappeared. In one case the backend exited. In the other, the backend moved on to the latest WAL file.

In another test, I opened a WAL file so that it could not be renamed or deleted. The appropriate error was logged and the .done file remained. The error is logged quite frequently. When released the WAL file it was soon deleted.

If you get into a case where the rename works but the unlink fails (I don't see how this could happen in real life, except possibly for a race condition with AV software), you will have a situation where there is a .done file that does not match any WAL logs, and you will have a .deleted file that won't get cleaned up.

I couldn't reproduce this, so I faked it by adding a .done file back into the archive_status folder after it was deleted. The orphaned .done file doesn't cause any trouble. It doesn't get cleaned up, it doesn't generate any log messages, and it doesn't interfere with WAL file recycling or removal (unlike the trouble that is caused by orphaned .ready files).

The patch looks good.

Thank-you,

-Luke

> -----Original Message-----
> From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
> Sent: Thursday, September 10, 2009 5:44 AM
> Cc: Tom Lane; Luke Koops; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> Heikki Linnakangas wrote:
> > Tom Lane wrote:
> >> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> >>> No, it's a backend that's holding the file open, with
> FILE_SHARE_DELETE.
> >> If that's the only case we care about covering, then
> rename might be
> >> enough. I was just wondering what it would take to solve the more
> >> general problem of something holding it open with the
> wrong flags at
> >> the time we want to get rid of it.
> >
> > Yes, that's a separate problem, and I think we should
> address that too.
> > That's what I thought was going on in OP's case at first,
> the patch I
> > posted in my first reply should address that.
> >
> > I'll try to reproduce that case too, and verify that the
> patch fixes it.
>
> Ok, I've committed a patch along those lines. The file is now
> renamed before unlinking (on Windows), and the return code of
> rename() and
> unlink() is checked, so that we don't delete the .done file
> if the WAL file deletion failed. This fixes both scenarios,
> the one OP reported with another backend keeping the file
> open, and the one where a different process keeps a file open
> without FILE_SHARE_DELETE.
>
> I considered making failure to rename or delete a WARNING
> instead of ERROR, so that RemoveOldXLogFiles() would still
> clean up any other old WAL files. However, when a file is
> recycled, we throw an error anyway if the rename fails in
> InstallXLogFileSegment(), so it doesn't seem like it would
> buy us much.
>
> BTW, it seems that errno is not set on Windows when rename
> fails, but we still try to print the OS error message in
> InstallXLogFileSegment().
> When I tested the case where another process is keeping the
> file locked, for example, I got this:
>
> ERROR: could not rename file
> "pg_xlog/000000010000000100000073" to
> "pg_xlog/000000010000000100000092" (initialization of log
> file 1, segment 146): No such file or directory
>
> even though the file clearly exists, it's just locked. I'm
> not sure where errno is coming from in that case, and if we
> should do something about that, but that exceeds my appetite
> for fixing Windows issues right now.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Aoyai Kouhei 2009-09-12 15:49:22 BUG #5050: text to timestamp failure
Previous Message Tom Lane 2009-09-12 00:06:06 Re: BUG #5049: query crashing backend with TRAP: FailedAssertion