Re: unable to fail over to warm standby server

From: Mason Hale <mason(at)onespot(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-bugs(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: unable to fail over to warm standby server
Date: 2010-01-29 16:27:26
Message-ID: 1e85dd391001290827h5721dc3fn3842f5a163165728@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Fujii --

Thanks for the clarification. It's clear my understanding of the recovery
process is lacking.

My naive assumption was that Postgres would recover using whatever files
were available
and if it had run out of files it would stop there and come up. And that if
recovery.conf were renamed it would stop copying files from the wal_archive
into pg_xlog. Thus without the recovery.conf file, the database would just
come up, without expecting or waiting on additional files. I see my
assumption was wrong, but I think you can agree that it is not surprising
someone could expect things this work this way if they aren't directly
familiar with the code.

I think you can also see how seeing the message "If this has occurred more
than once some data might be corrupted and you might need to choose an
earlier recovery target" in the log would lead me to believe my database was
corrupted.

It is good to know that if I had left recovery.conf in place and just
removed the trigger file the issue would have resolved itself.

I'm happy to hear the database was not, in fact, corrupted by this error.

Perhaps its best to chalk this up to a scenario that creates a confusing,
hard-to-diagnose issue -- one that easily looks like corruption, but
thankfully is not.

Hopefully if anyone tuning into this thread experiences or hears of similar
fail-over problems in the future (say on IRC), they'll remember to check the
permissions on the trigger file.

Thanks again,
Mason

On Fri, Jan 29, 2010 at 10:02 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Fri, Jan 29, 2010 at 11:49 PM, Mason Hale <mason(at)onespot(dot)com> wrote:
> > While I did not remove the trigger file, I did rename recovery.conf to
> > recovery.conf.old.
> > That file contained the recovery_command configuration that identified
> the
> > trigger file. So that rename should have eliminated the problem. But it
> > didn't. Even after making this change and taking the trigger file out of
> the
> > equation my database failed to come online.
>
> Renaming of the recovery.conf doesn't resolve the problem at all. Instead,
> the sysadmin had to remove only the trigger file with a wrong permission
> and just restart postgres.
>
> >> 9.) The server did not come up (again). This time the contents of the
> >> new postgresql.log file were:
> >>
> >> [postgres(at)prod-db-2 pg_log]$ tail -n 100
> postgresql-2010-01-18_211132.log
> >> 2010-01-18 21:11:32 UTC ()LOG: database system was interrupted while in
> recovery at log time 2010-01-18 20:10:59 UTC
> >> 2010-01-18 21:11:32 UTC ()HINT: If this has occurred more than once
> some data might be corrupted and you might need to choose an earlier
> recovery target.
> >> 2010-01-18 21:11:32 UTC ()LOG: could not open file
> "pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No such
> file or directory
> >> 2010-01-18 21:11:32 UTC ()LOG: invalid primary checkpoint record
> >> 2010-01-18 21:11:32 UTC ()LOG: could not open file
> "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No such
> file or directory
> >> 2010-01-18 21:11:32 UTC ()LOG: invalid secondary checkpoint record
> >> 2010-01-18 21:11:32 UTC ()PANIC: could not locate a valid checkpoint
> record
> >> 2010-01-18 21:11:32 UTC ()LOG: startup process (PID 9328) was
> terminated by signal 6: Aborted
> >> 2010-01-18 21:11:32 UTC ()LOG: aborting startup due to startup process
> failure
>
> You seem to focus on the above trouble. I think that this happened because
> recovery.conf was deleted and restore_command was not given. In fact, the
> WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery
> was unable to be restored from the archive because restore_command was
> not supplied. Then recovery failed.
>
> If the sysadmin had left the recovery.conf and removed the trigger file,
> pg_standby in restore_command would have restored all WAL files required
> for recovery, and recovery would advance well.
>
> Hope this helps.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Mike Bresnahan 2010-01-29 18:35:33 Re: Amazon EC2 CPU Utilization
Previous Message Jehan-Guillaume (ioguix) de Rorthais 2010-01-29 16:07:17 BUG #5301: difference of behaviour between 8.3 and 8.4 on IS NULL with sub rows of nulls