Quick Links

Re: WAL Restore process during recovery

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL Restore process during recovery
Date:	2012-01-23 15:23:22
Message-ID:	CA+U5nM+fz+OnWbLePcPn-3LP1HjL19Kx2nO8u+k-Uyiy=H8f4w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> In StartChildProcess(), the code which emits an error when fork of walrestore
> fails is required.

> In reaper(), the following comment needs to be updated because an unexpected
> exit (including FATAL) is treated as a crash in the patch.
>
> /*
> * Was it the wal restore? If exit status is zero (normal) or one
> * (FATAL exit), we assume everything is all right just like normal
> * backends.
> */
> if (pid == WalRestorePID)

> Why does walrestore need to be invoked even when restore_command is
> not specified? It seems to be useless. We invoke walreceiver only when
> primary_conninfo is specified now. Similarly we should invoke walrestore
> only when restore_command is specified?

walreceiver is shutdown and restarted in case of failed connection.
That never happens with walrestore because the command is run each
time - when we issue system(3) a new process is forked to run the
command. So there is no specific cleanup to perform and so no reason
for a managed cleanup process.

So I can't see a specific reason to change that. Do you think it makes
a difference?

> When I set up the file-based log-shipping environment using pg_standby,
> ran "pgbench -i -s2", waited for walrestore to restore at least one WAL
> file, and created the trigger file, then I encounterd the following error in
> the standby.
>
> sby LOG: startup process requests 000000010000000000000003 from archive
> trigger file found: smart failover
> sby LOG: startup process sees last file was 000000010000000000000003
> sby FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to
> "pg_xlog/000000010000000000000003": No such file or directory
> sby LOG: startup process (PID 11079) exited with exit code 1
> sby LOG: terminating any other active server processes

Will look further.

> When I set up streaming replication with setting restore_command,
> I got the following messages repeatedly. The WAL file name was always
> "000000000000000000000000".

Will look further.

> sby1 LOG: walrestore checking for next file to restore
> sby1 LOG: restore of 000000000000000000000000 is already complete, so sleep

Will look further.

> In PostmasterStateMachine(), the following comment needs to mention WALRestore.
>
> * PM_WAIT_READONLY state ends when we have no regular backends that
> * have been started during recovery. We kill the startup and
> * walreceiver processes and transition to PM_WAIT_BACKENDS. Ideally,

> In walrestore.c, the following comments seem to be incorrect. At least
> an unexpected
> exit of WALRestore doesn't start a recovery cycle in the patch.
>
> * If the WAL restore exits unexpectedly, the postmaster treats
> that the same
> * as a backend crash: shared memory may be corrupted, so remaining backends
> * should be killed by SIGQUIT and then a recovery cycle started.

Yes it does...

> In walrestore.c
> + * Main entry point for walrestore process
> + *
> + * This is invoked from BootstrapMain, which has already created the basic
> + * execution environment, but not enabled signals yet.
>
> BootstrapMain() doesn't exist, and it should be changed to
> AuxiliaryProcessMain().
> This is not a fault of the patch. There are the same typos in
> bgwriter.c, walwriter.c
> and checkpointer.c

OK, will fix.

> In walrestore.c
> + * SIGUSR1 is presently unused; keep it spare in case someday we want this
> + * process to participate in ProcSignal signalling.
>
> The above comment is incorrect because SIGUSR1 is presently used.

> + /*
> + * From here on, elog(ERROR) should end with exit(1), not send
> + * control back to the sigsetjmp block above
> + */
> + ExitOnAnyError = true;
>
> The above is not required because sigsetjmp is not used in walrestore.c

> + /* Normal exit from the walwriter is here */
> + proc_exit(0); /* done */
>
> Typo: s/walwriter/walrestore

Cleaned up the points noted, new patch attached in case you wish to
review further.

Still has bug, so still with me to fix.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
walrestore_process.v3.patch	text/x-patch	50.8 KB

In response to

Re: WAL Restore process during recovery at 2012-01-23 12:23:52 from Fujii Masao

Responses

Re: WAL Restore process during recovery at 2012-01-24 09:43:47 from Fujii Masao

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dimitri Fontaine	2012-01-23 15:26:22	Re: Inline Extension
Previous Message	Robert Haas	2012-01-23 15:16:20	Re: Inline Extension