BUG #13459: Replaying WAL logs can hang on startup

From: chris+postgresql(at)qwirx(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #13459: Replaying WAL logs can hang on startup
Date: 2015-06-22 08:42:01
Message-ID: 20150622084201.3874.74819@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 13459
Logged by: Chris Wilson
Email address: chris+postgresql(at)qwirx(dot)com
PostgreSQL version: 9.4.1
Operating system: Linux 2.6.32-220.30.1.el6.x86_64
Description:

Dear developers,

I restored a standby from a pg_basebackup made from the master the previous
morning. As a result, it had a lot of WAL logs to catch up on.

At one point it hangs while restoring logs. It normally takes a few seconds
to process a 16 MB WAL segment, but on this one, it was "recovering
0000000100000CEC00000025" for 7 minutes now with no log output at all.

I restarted it at least once and it reached the same point and hung again. I
attached gdb and saw that it was hung here:

(gdb) bt
#0 0x0000003f358e1433 in __select_nocancel () from /lib64/libc.so.6
#1 0x00000000008be748 in pg_usleep ()
#2 0x000000000074ef9f in ForgetRelationFsyncRequests ()
#3 0x000000000074d0b0 in mdunlink ()
#4 0x000000000074faa5 in smgrdounlink ()
#5 0x00000000004d5b3a in xact_redo_commit_internal ()
#6 0x00000000004d5c36 in xact_redo_commit ()
#7 0x00000000004d5f37 in xact_redo ()
#8 0x00000000004eb36a in StartupXLOG ()
#9 0x00000000006eac29 in StartupProcessMain ()
#10 0x00000000004f9eeb in AuxiliaryProcessMain ()
#11 0x00000000006e9ea1 in StartChildProcess ()
#12 0x00000000006e50ab in PostmasterMain ()
#13 0x000000000064e693 in main ()

Browsing the code, I saw this interesting comment in
ForgetRelationFsyncRequests:

void
ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
{
if (pendingOpsTable)
{
/* standalone backend or startup process: fsync state is
local */
RememberFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC);
}
else if (IsUnderPostmaster)
{
/*
* Notify the checkpointer about it. If we fail to queue
the cancel
* message, we have to sleep and try again ... ugly, but
hopefully
* won't happen often.
*
* XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
* error would leave the no-longer-used file still present
on disk,
* which would be bad, so I'm inclined to assume that the
checkpointer
* will always empty the queue soon.
*/
while (!ForwardFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC))
pg_usleep(10000L); /* 10 msec seems a good
number */

/*
* Note we don't wait for the checkpointer to actually
absorb the
* cancel message; see mdsync() for the implications.
*/
}
}

My guess is that ForwardFsyncRequest() is continually returning false, and
this code is stuck forever. I noticed that it says that "I'm inclined to
assume that the checkpointer
will always empty the queue soon", but there is no checkpointer running
during recovery, is there?

Although I managed to recover the system by restoring from a different
backup the next day, bypassing this logfile, I think I still have the copy
that hangs on this logfile so I should be able to conduct further
investigations if required. Please let me know.

Cheers, Chris.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message 1809563951 2015-06-22 13:15:52 BUG #13460: ERROR: could not find block containing chunk 0x2930072
Previous Message gregburek 2015-06-22 04:33:37 BUG #13458: postgres_fdw with usermapping dumped with pg_dump --no-owners results in dump file with implicit own