Re: unable to fail over to warm standby server

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Mason Hale <mason(at)onespot(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: unable to fail over to warm standby server
Date: 2010-01-28 09:49:22
Message-ID: 4B615DA2.3040306@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Mason Hale wrote:
> ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permittedtrigger file found
>
> ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted
>
> This file was not looked until after the attempt to recover was
> aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> problem,
> but we don't see how that would explain the error messages, which seem
> to indicate that data on the standby server was corrupted.

Yes, that permission problem seems to be the root cause of the troubles.
If pg_standby fails to remove the trigger file, it exit()s with whatever
return code the unlink() call returned:

> /*
> * If trigger file found, we *must* delete it. Here's why: When
> * recovery completes, we will be asked again for the same file from
> * the archive using pg_standby so must remove trigger file so we can
> * reload file again and come up correctly.
> */
> rc = unlink(triggerPath);
> if (rc != 0)
> {
> fprintf(stderr, "\n ERROR: could not remove \"%s\": %s", triggerPath, strerror(errno));
> fflush(stderr);
> exit(rc);
> }

unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
the range of normal return codes, and apparently gets mangled into the
mysterious 65280 code you saw in the logs. The server treats that as a
fatal error, and dies.

That seems like a bug in pg_standby, but I'm not sure what it should do
if the unlink() fails. It could exit with some other exit code, so that
the server wouldn't die, but the lingering trigger file could cause
problems, as the comment explains. If it should indeed cause FATAL, it
should do so in a more robust way than the exit(rc) call above.

BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
delete the trigger file (so that problematic block of code is gone), but
there's a new restore_end_command option in recovery.conf instead, where
you're supposed to put 'rm <triggerfile>'. I think in that
configuration, the standby would've started up, even though removal of
the trigger file would've still failed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mason Hale 2010-01-28 15:03:46 Re: unable to fail over to warm standby server
Previous Message Craig Ringer 2010-01-28 07:22:10 Re: BUG #5298: emedded SQL in C to get the record type from plpgsql