Re: pg_rewind in contrib

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <mpaquier(at)vmware(dot)com>
Subject: Re: pg_rewind in contrib
Date: 2015-03-11 08:53:00
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/11/2015 05:01 AM, Amit Kapila wrote:
> On Wed, Mar 11, 2015 at 3:44 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> On 03/10/2015 07:46 AM, Amit Kapila wrote:
>>> Isn't it possible incase of async replication that old cluster has
>>> some blocks which new cluster doesn't have, what will it do
>>> in such a case?
>> Sure, that's certainly possible. If the source cluster doesn't have some
>> blocks that exist in the target, IOW a file in the source cluster is
>> shorter than the same file in the target, that means that the relation was
>> truncated in the source.
> Can't that happen if the source database (new-master) haven't
> received all of the data from target database (old-master) at the
> time of promotion?
> If yes, then source database won't have WAL for truncation and
> the way current mechanism works is must.
> Now I think for such a case doing truncation in the target database
> is the right solution,

Yeah, that can happen, and truncation is the correct fix for it. The
logic is pretty well explained by this comment in filemap.c:

* It's a data file that exists in both.
* If it's larger in target, we can truncate it. There will
* also be a WAL record of the truncation in the source
* system, so WAL replay would eventually truncate the target
* too, but we might as well do it now.
* If it's smaller in the target, it means that it has been
* truncated in the target, or enlarged in the source, or
* both. If it was truncated locally, we need to copy the
* missing tail from the remote system. If it was enlarged in
* the remote system, there will be WAL records in the remote
* system for the new blocks, so we wouldn't need to copy them
* here. But we don't know which scenario we're dealing with,
* and there's no harm in copying the missing blocks now, so
* do it now.
* If it's the same size, do nothing here. Any locally
* modified blocks will be copied based on parsing the local
* WAL, and any remotely modified blocks will be updated after
* rewinding, when the remote WAL is replayed.

> however should we warn user in some way
> (either by mentioning about it in docs or in the pg_rewind utility after
> it does truncation) that some of it's data that belongs to old-master
> will be overridden by this operation, so if he wants he can keep a
> backup copy of the same.

Well, pg_rewind *always* overwrites any transactions that were committed
in the old master but not streamed to the standby. That's the whole
point. You're probably right that we should stress that out more in the
docs, though.

I've been thinking that it would be nice to print out a list of such
transactions that pg_rewind is going to overwrite. The admin could look
at the (hopefully short) list and perhaps try to fix up the data
manually afterwards, to recover the lost transactions. Perhaps we could
use the logical decoding stuff for that, to print out the transactions
in a human-readable format. But that's a TODO for the future.

>> I have tried to test some form of such a case and it seems to be
>>> failing with below error:
>>> pg_rewind.exe -D ..\..\Data\ --source-pgdata=..\..\Database1
>>> The servers diverged at WAL position 0/16DE858 on timeline 1.
>>> Rewinding from last common checkpoint at 0/16B8A70 on timeline 1
>>> could not open file "..\..\Data\/base/12706/16391" for truncation: No such
>>> file
>>> or directory
>>> Failure, exiting
>> Hmm, could that be just because of the funny business with the Windows
>> path separators? Does it work if you use "-D ..\..\Data" instead, without
>> the last backslash?
> I have tried without backslash as well, but still it returns
> same error.
> pg_rewind.exe -D ..\..\Data --source-pgdata=..\..\Database1
> The servers diverged at WAL position 0/1769BD8 on timeline 5.
> Rewinding from last common checkpoint at 0/1769B30 on timeline 5
> could not open file "..\..\Data/base/12706/16394" for truncation: No such
> file or directory
> Failure, exiting

I tried to reproduce this, but it tripped the "Assert(entry->isrelfile)"
assertion in process_block_change. However, that seems to be an
unrelated issue - pg_rewind was not handling FSM blocks correctly. It's
supposed to ignore them but extactPageInfo didn't get the memo. I think
I broke that when doing the changes for the new WAL record format.

After fixing that (new patch attached), your test case works fine for
me. I'm using the attached bash script to test it. Can you test if the
attached script works for you, and if it does, see if you can "fix" the
script so that it reproduces the error you're seeing?

> Another point is that after above error, target database
> gets corrupted. Basically the target database contains
> an extra data of source database and part of it's data.
> I think thats because truncation didn't happened.

Yeah, if something goes wrong during pg_rewind, the target database is
toast. Taking a backup, and using --dry-run is highly recommended ;-).
As a safety feature, perhaps pg_rewind should temporarily rename the
control file or something like that, so that if it's interrupted, the
target database will refuse to start up. Then again, that would make it
more difficult to do forensics or disaster recovery on the database, if
you didn't take a backup.

> On retry it gives below message:
> pg_rewind.exe -D ..\..\Data --source-pgdata=..\..\Database1
> source and target cluster are on the same timeline
> Failure, exiting
> I think message displayed in this case is okay, however
> displaying it as 'Failure' looks slightly odd.

Hmm. In other similar scenarios, pg_rewind will return Success with
message "No rewind required". Yeah, probably should do that in this case

- Heikki

Attachment Content-Type Size
pg_rewind-bin-8.patch.gz application/gzip 30.0 KB application/x-shellscript 2.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Sawada Masahiko 2015-03-11 11:33:29 Re: Proposal : REINDEX xxx VERBOSE
Previous Message Ashutosh Bapat 2015-03-11 08:37:55 Re: EvalPlanQual behaves oddly for FDW queries involving system columns