Re: unable to fail over to warm standby server

From: Mason Hale <mason(at)onespot(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: unable to fail over to warm standby server
Date: 2010-01-28 15:03:46
Message-ID: 1e85dd391001280703l4c13e231m77e50e2630f34975@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Heikki --

Thank you for investigating this issue and clearing up this mystery.
I do not believe it is obvious that the postgres process needs to be able to
remove the trigger file.

My naive assumption was that the trigger file was merely a flag to signal
that recovery mode needed to be stopped. If I were to guess what those steps
would be, I would assume the following:

- detect the presence of the trigger file
- stop the postgres process safely (e.g pg_ctl ... stop)
- rename recovery.conf to recovery.done
- restart the postgres process (e.g. pg_ctl ... start)

It is not obvious that the trigger file needs to be removed.
And if permissions prevent it from being removed the last thing that should
happen is to cause to database to become corrupted.

At minimum the pg_standby documentation should make this requirement clear.
I suggest language to the effect of the following:

Note it is critical the trigger file be created with permissions that allow
> the postgres process to remove the file. Generally this is best done by
> creating the file from the postgres user account. Data corruption may result
> if the trigger file permissions prevent deletion of the trigger file.

Of course the best solution is to avoid this issue entirely. Something as
easy to miss as file permissions should not cause data corruption,
especially in the process meant to fail over from a crashing primary
database.

thanks,

Mason Hale
http://www.onespot.com
direct +1 800.618.0768 ext 701

On Thu, Jan 28, 2010 at 3:49 AM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Mason Hale wrote:
> > ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> > permittedtrigger file found
> >
> > ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permitted
> >
> > This file was not looked until after the attempt to recover was
> > aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> > problem,
> > but we don't see how that would explain the error messages, which seem
> > to indicate that data on the standby server was corrupted.
>
> Yes, that permission problem seems to be the root cause of the troubles.
> If pg_standby fails to remove the trigger file, it exit()s with whatever
> return code the unlink() call returned:
>
> > /*
> > * If trigger file found, we *must* delete it. Here's why:
> When
> > * recovery completes, we will be asked again for the same
> file from
> > * the archive using pg_standby so must remove trigger file
> so we can
> > * reload file again and come up correctly.
> > */
> > rc = unlink(triggerPath);
> > if (rc != 0)
> > {
> > fprintf(stderr, "\n ERROR: could not remove \"%s\":
> %s", triggerPath, strerror(errno));
> > fflush(stderr);
> > exit(rc);
> > }
>
> unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
> the range of normal return codes, and apparently gets mangled into the
> mysterious 65280 code you saw in the logs. The server treats that as a
> fatal error, and dies.
>
> That seems like a bug in pg_standby, but I'm not sure what it should do
> if the unlink() fails. It could exit with some other exit code, so that
> the server wouldn't die, but the lingering trigger file could cause
> problems, as the comment explains. If it should indeed cause FATAL, it
> should do so in a more robust way than the exit(rc) call above.
>
> BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
> delete the trigger file (so that problematic block of code is gone), but
> there's a new restore_end_command option in recovery.conf instead, where
> you're supposed to put 'rm <triggerfile>'. I think in that
> configuration, the standby would've started up, even though removal of
> the trigger file would've still failed.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Giorgio Valoti 2010-01-28 15:34:43 Status of submitted bugs
Previous Message Heikki Linnakangas 2010-01-28 09:49:22 Re: unable to fail over to warm standby server