Re: pg_ctl promote wait

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: pg_ctl promote wait
Date: 2016-03-09 20:08:24
Message-ID: CAB7nPqTS5J3-G_zTow0Kc5oqZn877RDDN1Mfcqm2PscAS7FnAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 29, 2016 at 4:29 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Feb 29, 2016 at 4:28 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> I would suggest using
>> $node_standby->poll_query_until('SELECT pg_is_in_recovery()') to
>> validate the end of the test.
>
> Meh. SELECT NOT pg_is_in_recovery(). This will wait until the query
> returns true.

Here are some comments about 0002

+ if ((fd = open(control_file_path, O_RDONLY | PG_BINARY, 0)) == -1)
+ {
+ fprintf(stderr, _("%s: could not open file \"%s\" for
reading: %s\n"),
+ progname, control_file_path, strerror(errno));
+ exit(1);
+ }
[...]
Most of the logic of get_control_dbstate() is a mimic of the
recently-introduced get_controlfile() in controldata_utils.c of
dc7d70e. I think that we had better use that, and that we had better
emit an error should an incorrect control file be found while running
those pg_ctl commands as the control file present had better have a
correct CRC all the time or something wrong is going on. So this would
lead to this logic for get_control_dbstate():
control_file_data = get_controlfile(pg_data, progname);
res = control_file_data->state;
pfree(control_file_data);

Except that, 0002 is a good thing to have, switching from the presence
of recovery.conf to what is in the control data file is definitely
more robust, a lot of things happen from when recovery.conf is renamed
to recovery.done until WAL is enabled for backends, particularly the
end of recovery checkpoint and the cleanup of the WAL segments of the
previous timeline.

And now for 0003...

+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT
pg_is_in_recovery()'],
+ qr/^t$/,
+ 'standby is in recovery');
[...]
+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT
pg_is_in_recovery()'],
+ qr/^f$/,
+ 'promoted standby is not in recovery');
for those two you can use $node_standby->psql_safe to get back a query result.

> The subsequent discussion mentioned that there might still be a window
> between end of waiting and when read-write queries would be accepted. I
> don't know how big that window would be in practice and would be
> interested in some testing and feedback.

And so... Based on the previous discussion, there is an interval of
time between the moment the update of the control file is done and the
point where backends are allowed to emit WAL. I am really worrying
about this interval of time actually, as once pg_ctl exits client
applications should be guaranteed to connect to the server but the
current patch would not be failure-proof, and I imagine that
particularly on CPU-constrained environments this is going to become
unstable. Particularly I expect that slow machines are likely going to
fail in the last test of 003_promote.pl as designed (I am away from
home now so I have not been able to test that unfortunately on my own
stuff but that's possible) because pg_is_in_recovery is controlled by
SharedRecoveryInProgress, so it may be possible that
pg_is_in_recovery() returns false while the control file status is
DB_IN_PRODUCTION. The main factor that can contribute to a larger
window is a higher number of 2PC transactions that need to be loaded
back to shared memory after scanning pg_twophase.

If we are going to have a reliable promote wait mode for pg_ctl, I
think that we had better first reduce this window, something that
could be done is to update SharedRecoveryInProgress while holding an
exclusive lock on ControlFileLock, with this flow for example. See for
example the patch attached, we can reduce this window to zero for
backends if some of them refer to ControlFile in shared memory thanks
to ControlFileLock. For clients, there will still be a small window
during which backends could write WAL and the control file status is
ARCHIVE_RECOVERY on disk. If we want to have a reliable promote wait
mode for pg_ctl, I think that we had better do something like the
attached first. Thoughts?

Looking at where is used the shared memory structure of ControlFile,
one thing to worry about is CreateRestartPoint but its code paths are
already using ControlFileLock when looking at the status file, so they
are safe with this logic.
--
Michael

Attachment Content-Type Size
delay-status-update-end-recovery.patch text/x-patch 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robbie Harwood 2016-03-09 20:14:20 Re: [PATCH v6] GSSAPI encryption support
Previous Message Michael Paquier 2016-03-09 20:05:40 Re: Recovery test failure for recovery_min_apply_delay on hamster