pgsql: Fix bugs in cascading replication with recovery_target_timeline=

From: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Fix bugs in cascading replication with recovery_target_timeline=
Date: 2012-09-05 02:34:43
Message-ID: E1T95Rz-0000a6-1N@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix bugs in cascading replication with recovery_target_timeline='latest'

The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.

This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.

Branch
------
REL9_2_STABLE

Details
-------
http://git.postgresql.org/pg/commitdiff/af35e66f04fe7e0430046e0bc83d8d117e95ba2f

Modified Files
--------------
src/backend/access/transam/xlog.c | 63 +++++++++++++++++------------------
src/backend/replication/walsender.c | 28 ++++++++++++++-
src/include/access/xlog.h | 4 +-
3 files changed, 59 insertions(+), 36 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Bruce Momjian 2012-09-05 04:01:26 pgsql: In pg_upgrade, document why we can't issue \n\n in the command l
Previous Message Kevin Grittner 2012-09-05 02:14:54 pgsql: Fix serializable mode with index-only scans.