Dimitri Fontaine wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> 1. Initial archive recovery. Standby fetches WAL files from archive
>> using restore_command. When a file is not found in archive, we start
>> walreceiver and switch to state 2
>> 2. Retrying to restore from archive. When the connection to primary is
>> established and replication is started, we switch to state 3
> When do the master know about this new slave being there? I'd say not
> until 3 is ok, and then, the actual details between 1 and 2 look
> strange, partly because it's more about processes than states.
Right. The master doesn't need to know about the slave.
> I'd propose to have 1 and 2 started in parallel from the beginning, and
> as Simon proposes, being able to get back to 1. at any time:
> 0. start from a base backup, determine the first WAL / LSN we need to
> start streaming, call it SR_LSN. That means asking the master its
> current xlog location.
What if the master can't be contacted?
> The LSN we're at now, after replaying the base
> backup and maybe the initial recovery from local WAL files, let's
> call it BASE_LSN.
> 1. Get the missing WAL to get from BASE_LSN to SR_LSN from the archive,
> with restore_command, apply them as we receive them, and start
> 2. possibly in parallel
> 2. Streaming replication: we connect to the primary and walreceiver gets
> the WALs from the connection. It either stores them if current
> standby's position < SR_LSN or apply them directly if we were already
> Local storage would be either standby's archiving or a specific
> temporary location. I guess it's more or less what you want to do
> with retrying from the master's archives, but I'm not sure your line
> of though makes it simpler.
> The details about when a slave is in sync will get more important as
> soon as we have synchronous streaming.
Yeah, a lot of that logic and states is completely unnecessary until we
have a synchronous mode. Even then, it seems complex.
Here's what I've been hacking:
First of all, walreceiver no longer tries to retry the connection on
error, and postmaster no longer tries to relaunch it if it dies. So when
Walreceiver is launched, it tries to connect once, and if successful,
streams until an error occurs or it's killed.
When startup process needs more WAL to continue replay, the logic is in
while (<need more wal>)
if(<walreceiver is alive>)
wait for WAL to arrive, or for walreceiver to die.
If (restore_command succeeded)
Sleep 5 seconds
So there's just two states:
1. Recovering from archive
We start from 1, and switch state at error.
This gives nice behavior from a user point of view. Standby tries to
make progress using either the archive or streaming, whichever becomes
Attached is a WIP patch implementing that, also available in the
'replication-xlogrefactor' branch in my git repository. It includes the
Read/FetchRecord refactoring I mentioned earlier; that's a pre-requisite
The code implementing the above retry logic in XLogReadPage(), in xlog.c.
In response to
pgsql-hackers by date
|Next:||From: Magnus Hagander||Date: 2010-01-20 19:27:04|
|Subject: Re: [NOVICE] Python verison for build in config.pl (Win32)|
|Previous:||From: Tom Lane||Date: 2010-01-20 19:24:07|
|Subject: Re: [HACKERS] Python verison for build in config.pl (Win32) |