On Fri, Mar 23, 2012 at 11:03:27PM +0900, Fujii Masao wrote:
> > On Wed, Feb 29, 2012 at 5:48 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >> In streaming replication, after failover, new master might have lots
> >> of un-applied
> >> WAL files with old timeline ID. They are the WAL files which were recycled as a
> >> future ones when the server was running as a standby. Since they will never be
> >> used later, they don't need to be archived after failover. But since they have
> >> neither .ready nor .done file in archive_status, checkpoints after
> >> failover newly
> >> create .reacy files for them, and then finally they are archived.
> >> Which might cause
> >> disk I/O spike both in WAL and archive storage.
If the old master archived later WAL that the new master never restored, won't
this attempt to archive a file under a name that already exists in the
archive? The documentation says this:
The archive command should generally be designed to refuse to overwrite any
pre-existing archive file. This is an important safety feature to preserve
the integrity of your archive in case of administrator error (such as
sending the output of two different servers to the same archive directory).
It is advisable to test your proposed archive command to ensure that it
indeed does not overwrite an existing file, and that it returns nonzero
status in this case.
Archiving on the new master would halt until the operator intervenes.
> >> To avoid the above problem, I think that un-applied WAL files with old
> >> timeline ID
> >> should be marked as already-archived and recycled immediately at the end of
> >> recovery. Thought?
A small hazard comes to mind. If the administrator manually copied
post-timeline-divergence segments from the failed master to the new master's
pg_xlog, the current implementation loads them into the archive for you. The
new master could never apply those files locally, but they might be useful for
alternate recoveries down the previous timeline. Nonetheless, we can just as
reasonably specify that it's not a role of the new master to provide this
service. Call the fact that it did so in previous releases an implementation
What about instead creating an archive status file at recycle time and
deleting it as we begin to populate the file? That distinguishes copied-in,
unarchived segments from recycled ones.
Incidentally, RemoveOldXlogFiles() has this comment:
* We ignore the timeline part of the XLOG segment identifiers in
* deciding whether a segment is still needed. This ensures that we
* won't prematurely remove a segment from a parent timeline. We could
* probably be a little more proactive about removing segments of
* non-parent timelines, but that would be a whole lot more
Should both instances of "parent" be "child" or "descendant"?
> Just after failover, there can be three kinds of WAL files in new
> master's pg_xlog directory:
> (1) WAL files which were recycled to by restartpoint
> I've already explained upthread the issue which these WAL files cause
> after failover.
> (2) WAL files which were restored from the archive
> In 9.1 or before, the restored WAL files don't remain after failover
> because they are always restored onto the temporary filename
> "RECOVERYXLOG". So the issue which I explain from now doesn't exist
> in 9.1 or before.
> In 9.2dev, as the result of supporting cascade replication,
> an archived WAL file is restored onto correct file name so that
> cascading walsender can send it to another standby. This restored
The documentation still says this:
WAL segments that cannot be found in the archive will be sought in pg_xlog/;
this allows use of recent un-archived segments. However, segments that are
available from the archive will be used in preference to files in
pg_xlog/. The system will not overwrite the existing contents of pg_xlog/
when retrieving archived files.
I gather the last sentence is now false?
> WAL file has neither .ready nor .done archive status file. After
> failover, checkpoint checks the archive status file of the restored
> WAL file to attempt to recycle it, finds that it has neither .ready
> nor ,done, and creates .ready. Because of existence of .ready,
> it will be archived again even though it obviously already exists in
> the archival storage :(
> To prevent a restored WAL file from being archived again, I think
> that .done should be created whenever WAL file is successfully
> restored (of course this should happen only when archive_mode is
> enabled). Thought?
Your proposed fix makes sense, and I cannot think of any disadvantage.
Concerning only doing it when archive_mode=on, would there ever be a case
where a segment is restored under archive_mode=off, then the server restarted
with archive_mode=on and an archival attempted on that segment?
> (3) WAL files which were streamed from the master
> These WAL files also don't have any archive status, so checkpoint
> creates .ready for them after failover. And then, all or many of
> them will be archived at a time, which would cause I/O spike on
> both WAL and archival storage.
> To avoid this problem, I think that we should change walreceiver
> so that it creates .ready as soon as it completes the WAL file. Also
> we should change the archiver process so that it starts up even in
> standby mode and archives the WAL files.
> If each server has its own archival storage, the above solution would
> work fine. But if all servers share the archival storage, multiple archiver
> processes in those servers might archive the same WAL file to
> the shared area at the same time. Is this OK? If not, to avoid this,
> we might need to separate archive_mode into two: one for normal mode
> (i.e., master), another for standbfy mode. If the archive is shared,
> we can ensure that only one archiver in the master copies the WAL file
> at the same time by disabling WAL archiving in standby mode but
> enabling it in normal mode. Thought?
I don't think we should remove the recommendation to make archive_command fail
when the archive already has the file. However, the new master is likely to
have at least one segment not appearing in the archive along with some
already-archived segments. There's certainly a use case for completing the
shared archive with local-only segments. I think this also ties into the
prerequisites for letting former peers of the new master begin to follow the
new master without fresh base backups.
More thought is needed here.
In response to
pgsql-hackers by date
|Next:||From: Alexander Korotkov||Date: 2012-06-05 06:45:37|
|Subject: Re: Bug in new buffering GiST build code|
|Previous:||From: Jeff Davis||Date: 2012-06-05 01:26:04|
|Subject: 9.3: load path to mitigate load penalty for checksums|