Re: WAL Restore process during recovery

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL Restore process during recovery
Date: 2012-01-23 15:23:22
Message-ID: CA+U5nM+fz+OnWbLePcPn-3LP1HjL19Kx2nO8u+k-Uyiy=H8f4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> In StartChildProcess(), the code which emits an error when fork of walrestore
> fails is required.

OK

> In reaper(), the following comment needs to be updated because an unexpected
> exit (including FATAL) is treated as a crash in the patch.
>
>                /*
>                 * Was it the wal restore?  If exit status is zero (normal) or one
>                 * (FATAL exit), we assume everything is all right just like normal
>                 * backends.
>                 */
>                if (pid == WalRestorePID)

OK

> Why does walrestore need to be invoked even when restore_command is
> not specified? It seems to be useless. We invoke walreceiver only when
> primary_conninfo is specified now. Similarly we should invoke walrestore
> only when restore_command is specified?

walreceiver is shutdown and restarted in case of failed connection.
That never happens with walrestore because the command is run each
time - when we issue system(3) a new process is forked to run the
command. So there is no specific cleanup to perform and so no reason
for a managed cleanup process.

So I can't see a specific reason to change that. Do you think it makes
a difference?

> When I set up the file-based log-shipping environment using pg_standby,
> ran "pgbench -i -s2", waited for walrestore to restore at least one WAL
> file, and created the trigger file, then I encounterd the following error in
> the standby.
>
>    sby LOG:  startup process requests 000000010000000000000003 from archive
>    trigger file found: smart failover
>    sby LOG:  startup process sees last file was 000000010000000000000003
>    sby FATAL:  could not rename file "pg_xlog/RECOVERYXLOG" to
> "pg_xlog/000000010000000000000003": No such file or directory
>    sby LOG:  startup process (PID 11079) exited with exit code 1
>    sby LOG:  terminating any other active server processes

Will look further.

> When I set up streaming replication with setting restore_command,
> I got the following messages repeatedly. The WAL file name was always
> "000000000000000000000000".

Will look further.

>    sby1 LOG:  walrestore checking for next file to restore
>    sby1 LOG:  restore of 000000000000000000000000 is already complete, so sleep

Will look further.

> In PostmasterStateMachine(), the following comment needs to mention WALRestore.
>
>                 * PM_WAIT_READONLY state ends when we have no regular backends that
>                 * have been started during recovery.  We kill the startup and
>                 * walreceiver processes and transition to PM_WAIT_BACKENDS.  Ideally,

OK

> In walrestore.c, the following comments seem to be incorrect. At least
> an unexpected
> exit of WALRestore doesn't start a recovery cycle in the patch.
>
>     * If the WAL restore exits unexpectedly, the postmaster treats
> that the same
>     * as a backend crash: shared memory may be corrupted, so remaining backends
>     * should be killed by SIGQUIT and then a recovery cycle started.

Yes it does...

> In walrestore.c
> + * Main entry point for walrestore process
> + *
> + * This is invoked from BootstrapMain, which has already created the basic
> + * execution environment, but not enabled signals yet.
>
> BootstrapMain() doesn't exist, and it should be changed to
> AuxiliaryProcessMain().
> This is not a fault of the patch. There are the same typos in
> bgwriter.c, walwriter.c
> and checkpointer.c

OK, will fix.

> In walrestore.c
> +        * SIGUSR1 is presently unused; keep it spare in case someday we want this
> +        * process to participate in ProcSignal signalling.
>
> The above comment is incorrect because SIGUSR1 is presently used.

OK

> +                       /*
> +                        * From here on, elog(ERROR) should end with exit(1), not send
> +                        * control back to the sigsetjmp block above
> +                        */
> +                       ExitOnAnyError = true;
>
> The above is not required because sigsetjmp is not used in walrestore.c

OK

> +                       /* Normal exit from the walwriter is here */
> +                       proc_exit(0);           /* done */
>
> Typo: s/walwriter/walrestore

OK

Cleaned up the points noted, new patch attached in case you wish to
review further.

Still has bug, so still with me to fix.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
walrestore_process.v3.patch text/x-patch 50.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2012-01-23 15:26:22 Re: Inline Extension
Previous Message Robert Haas 2012-01-23 15:16:20 Re: Inline Extension