From: | Simon Riggs <simon(at)2ndQuadrant(dot)com> |
---|---|
To: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WAL Restore process during recovery |
Date: | 2012-01-23 15:23:22 |
Message-ID: | CA+U5nM+fz+OnWbLePcPn-3LP1HjL19Kx2nO8u+k-Uyiy=H8f4w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> In StartChildProcess(), the code which emits an error when fork of walrestore
> fails is required.
OK
> In reaper(), the following comment needs to be updated because an unexpected
> exit (including FATAL) is treated as a crash in the patch.
>
> /*
> * Was it the wal restore? If exit status is zero (normal) or one
> * (FATAL exit), we assume everything is all right just like normal
> * backends.
> */
> if (pid == WalRestorePID)
OK
> Why does walrestore need to be invoked even when restore_command is
> not specified? It seems to be useless. We invoke walreceiver only when
> primary_conninfo is specified now. Similarly we should invoke walrestore
> only when restore_command is specified?
walreceiver is shutdown and restarted in case of failed connection.
That never happens with walrestore because the command is run each
time - when we issue system(3) a new process is forked to run the
command. So there is no specific cleanup to perform and so no reason
for a managed cleanup process.
So I can't see a specific reason to change that. Do you think it makes
a difference?
> When I set up the file-based log-shipping environment using pg_standby,
> ran "pgbench -i -s2", waited for walrestore to restore at least one WAL
> file, and created the trigger file, then I encounterd the following error in
> the standby.
>
> sby LOG: startup process requests 000000010000000000000003 from archive
> trigger file found: smart failover
> sby LOG: startup process sees last file was 000000010000000000000003
> sby FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to
> "pg_xlog/000000010000000000000003": No such file or directory
> sby LOG: startup process (PID 11079) exited with exit code 1
> sby LOG: terminating any other active server processes
Will look further.
> When I set up streaming replication with setting restore_command,
> I got the following messages repeatedly. The WAL file name was always
> "000000000000000000000000".
Will look further.
> sby1 LOG: walrestore checking for next file to restore
> sby1 LOG: restore of 000000000000000000000000 is already complete, so sleep
Will look further.
> In PostmasterStateMachine(), the following comment needs to mention WALRestore.
>
> * PM_WAIT_READONLY state ends when we have no regular backends that
> * have been started during recovery. We kill the startup and
> * walreceiver processes and transition to PM_WAIT_BACKENDS. Ideally,
OK
> In walrestore.c, the following comments seem to be incorrect. At least
> an unexpected
> exit of WALRestore doesn't start a recovery cycle in the patch.
>
> * If the WAL restore exits unexpectedly, the postmaster treats
> that the same
> * as a backend crash: shared memory may be corrupted, so remaining backends
> * should be killed by SIGQUIT and then a recovery cycle started.
Yes it does...
> In walrestore.c
> + * Main entry point for walrestore process
> + *
> + * This is invoked from BootstrapMain, which has already created the basic
> + * execution environment, but not enabled signals yet.
>
> BootstrapMain() doesn't exist, and it should be changed to
> AuxiliaryProcessMain().
> This is not a fault of the patch. There are the same typos in
> bgwriter.c, walwriter.c
> and checkpointer.c
OK, will fix.
> In walrestore.c
> + * SIGUSR1 is presently unused; keep it spare in case someday we want this
> + * process to participate in ProcSignal signalling.
>
> The above comment is incorrect because SIGUSR1 is presently used.
OK
> + /*
> + * From here on, elog(ERROR) should end with exit(1), not send
> + * control back to the sigsetjmp block above
> + */
> + ExitOnAnyError = true;
>
> The above is not required because sigsetjmp is not used in walrestore.c
OK
> + /* Normal exit from the walwriter is here */
> + proc_exit(0); /* done */
>
> Typo: s/walwriter/walrestore
OK
Cleaned up the points noted, new patch attached in case you wish to
review further.
Still has bug, so still with me to fix.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
walrestore_process.v3.patch | text/x-patch | 50.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Fontaine | 2012-01-23 15:26:22 | Re: Inline Extension |
Previous Message | Robert Haas | 2012-01-23 15:16:20 | Re: Inline Extension |