We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:
Let me again summarize the problem we are trying to address.
When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.
Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.
Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL up to the shutdown checkpoint record before shutting down the
connection. Fujii Masao has already submitted a patch to handle clean
switch-over case, but the problem is still remaining for failback case.
The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
over a relatively slower link. This would break the service level
agreement of disaster recovery system. So there is need to improve the
process of disaster recovery in PostgreSQL. One way to achieve this is to
maintain consistency between master and standby which helps to avoid need
of fresh backup.
So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.
There are many suggestions and objections pgsql-hackers about this problem
The brief summary is as follows:
1. The main objection was raised by Tom and others is that we should not
add this feature and should go with traditional way of taking fresh backup
using the rsync, because he was concerned about the additional complexity
of the patch and the performance overhead during normal operations.
2. Tom and others were also worried about the inconsistencies in the
crashed master and suggested that its better to start with a fresh backup.
Fujii Masao and others correctly countered that suggesting that we trust
WAL recovery to clear all such inconsistencies and there is no reason why
we can't do the same here.
3. Someone is suggested using rsync with checksum, but many pages on the
two servers may differ in their binary values because of hint bits etc.
4. The major objection for failback without fresh backup idea was it may
introduce on performance overhead and complexity to the code. By looking at
the patch I must say that patch is not too complex. For performance impact
I tested patch with pgbench which shows that it has very small performance
overhead. Please refer the test results included at end of mail.
*Proposal to solve the problem*
The proposal is based on the concept of master should not do any file
system level change until corresponding WAL record is replicated to the
There are many places in the code which need to be handled to support the
proposed solution. Following cases explains the need of fresh backup at
the time of failover, and how can we avoid this need by our approach.
1. We must not write any heap pages to the disk before the WAL records
corresponding to those changes are received by the standby. Otherwise if
standby failed to receive WAL corresponding to those heap pages there will
2. When CHECKPOINT happens on the master, control file of master gets
updated and last checkpoint record is written to it. Suppose failover
happens and standby fails to receive the WAL record corresponding to
CHECKPOINT, then master and standby has inconsistent copies of control file
that leads to the mismatch in redo record and recovery will not start
normally. To avoid this situation we must not update the control file of
master before the corresponding checkpoint WAL record is received by the
3. Also when we truncate any of the physical files on the master and
suppose the standby failed to receive corresponding WAL, then that physical
file is truncated on master but still available on standby causing
inconsistency. To avoid this we must not truncate physical files on the
master before the WAL record corresponding to that operation is received by
4. Same case applies to CLOG pages. If CLOG page is written to the disk and
corresponding WAL record is not replicated to the standby, leads to the
inconsistency. So we must not write the CLOG pages (and may be other SLRU
pages too) to the disk before the corresponding WAL records are received by
5. The same problem applies for the commit hint bits. But it is more
complicated than the other problems, because no WAL records are generated
for that, hence we cannot apply the same above method, that is wait for
corresponding WAL record to be replicated on standby. So we delay the
processes of updating the commit hint bits, similar to what is done by
asynchronous commits. In other words we need to check if the WAL
corresponding to the transaction commit is received by the failback safe
standby and then only allow hint bit updates.
The initial work on this patch is done by Pavan Deolasee. I tested it and
will make further enhancements based on the community feedback.
This patch is not complete yet, but I plan to do so with the help of this
community. At this point, the primary purpose is to understand the
complexities and get some initial performance numbers to alleviate some of
the concerns raised by the community.
There are two GUC parameters which supports this failsafe standby
1. failback_safe_standby_name [ name of the failsafe standby ] It is the
name of failsafe standby. Master will not do any file system level change
before corresponding WAL is replicated on the this failsafe standby
2. failback_safe_standby_mode [ off/remote_write/remote_flush] This
parameter specifies the behavior of master i.e. whether it should wait for
WAL to be written on standby or WAL to be flushed on standby. We should
turn it off when we do not want the failsafe standby. This failsafe mode
can be combined with synchronous as well as asynchronous streaming
Most of the changes are done in the syncrep.c. This is a slight misnomer
because that file deals with synchronous standby and a failback standby
could and most like be a async standby. But keeping the changes this way
has ensured that the patch is easy to read. Once we have acceptance on the
approach, the patch can be modified to reorganize the code in a more
The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync
standby states. A backend which is waiting for a failback safe standby to
receive WAL records, will wait in this state. Failback safe mechanism can
work in two different modes, that is wait for WAL to be written or flushed
on failsafe standby. That is represented by two new modes
SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH
Also the SyncRepWaitForLSN() is changed for conditional wait. So that we
can delay hint bit updates on master instead of blocking the wait for the
failback safe standby to receiver WAL's.
*PostgreSQL versions:* PostgreSQL 9.3beta1
*Usage:* For operating in failsafe mode you need to configure following two
The test are performed on the servers having 32 GB RAM, checkpoint_timeout
is set to 10 minutes so that checkpoint will happen more frequently.
Checkpoint involves flushing all the dirty blocks to the disk and we wanted
to primarily test that code path.
Transaction type: TPC-B
Scaling factor: 100
Query mode: simple
Number of clients: 100
Number of threads: 1
Duration: 1800 s
Following table shows the average TPS measured for each scenario. We
conducted 3 tests for each scenario
1) Synchronous Replication - 947 tps
2) Synchronous Replication + Failsafe standby (off) - 934 tps
3) Synchronous Replication + Failsafe standby (remote_flush) - 931 tps
4) Asynchronous Replication - 1369 tps
5) Asynchronous Replication + Failsafe standby (off) - 1349 tps
6) Asynchronous Replication + Failsafe standby (remote_flush) - 1350 tps
By observing the table we can conclude following:
1. Streaming replication + failback safe:
a) On an average, synchronous replication combined with failsafe standby
(remote_flush) causes 1.68 % performance overhead.
b) On an average, asynchronous streaming replication combined with
failsafe standby (remote_flush) causes averagely 1.38 % performance
2. Streaming replication + failback safe (turned off):
a) Averagely synchronous replication combined with failsafe standby
(off) causes 1.37 % performance overhead.
b) Averagely asynchronous streaming replication combined with failsafe
standby (off) causes averagely 1.46 % performance degradation.
So the patch is showing 1-2% performance overhead.
Please give your suggestions if there is a need to perform tests for other
1. Currently this patch supports only one failback safe standby. It can
either be synchronous or an asynchronous standby. We probably need to
discuss whether it needs to be changed for support of multiple failsafe
2. Current design of patch will wait forever for the failback safe standby.
Streaming replication also has same limitation. We probably need to discuss
whether it needs to be changed.
There are couples of more places that probably need some attention and I
have marked them with XXX
pgsql-hackers by date
|Next:||From: Kyotaro HORIGUCHI||Date: 2013-06-14 09:16:42|
|Subject: Re: Reduce maximum error in tuples estimation after vacuum.|
|Previous:||From: KONDO Mitsumasa||Date: 2013-06-14 08:52:29|
|Subject: Re: Improvement of checkpoint IO scheduler for stable transaction