Re: pg_rewind in contrib

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: hlinnaka(at)iki(dot)fi
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <mpaquier(at)vmware(dot)com>
Subject: Re: pg_rewind in contrib
Date: 2015-03-11 04:01:58
Message-ID: CAA4eK1KWFdJSv_KqURd-zhETzhnzm9ESHRNgQAJYExVrdaaMDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 11, 2015 at 3:44 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 03/10/2015 07:46 AM, Amit Kapila wrote:
>
>>
>> Isn't it possible incase of async replication that old cluster has
>> some blocks which new cluster doesn't have, what will it do
>> in such a case?
>>
>
> Sure, that's certainly possible. If the source cluster doesn't have some
> blocks that exist in the target, IOW a file in the source cluster is
> shorter than the same file in the target, that means that the relation was
> truncated in the source.

Can't that happen if the source database (new-master) haven't
received all of the data from target database (old-master) at the
time of promotion?
If yes, then source database won't have WAL for truncation and
the way current mechanism works is must.

Now I think for such a case doing truncation in the target database
is the right solution, however should we warn user in some way
(either by mentioning about it in docs or in the pg_rewind utility after
it does truncation) that some of it's data that belongs to old-master
will be overridden by this operation, so if he wants he can keep a
backup copy of the same.

> I have tried to test some form of such a case and it seems to be
>> failing with below error:
>>
>> pg_rewind.exe -D ..\..\Data\ --source-pgdata=..\..\Database1
>> The servers diverged at WAL position 0/16DE858 on timeline 1.
>> Rewinding from last common checkpoint at 0/16B8A70 on timeline 1
>>
>> could not open file "..\..\Data\/base/12706/16391" for truncation: No such
>> file
>> or directory
>> Failure, exiting
>>
>
> Hmm, could that be just because of the funny business with the Windows
> path separators? Does it work if you use "-D ..\..\Data" instead, without
> the last backslash?
>
>
I have tried without backslash as well, but still it returns
same error.

pg_rewind.exe -D ..\..\Data --source-pgdata=..\..\Database1
The servers diverged at WAL position 0/1769BD8 on timeline 5.
Rewinding from last common checkpoint at 0/1769B30 on timeline 5

could not open file "..\..\Data/base/12706/16394" for truncation: No such
file or directory
Failure, exiting

I have even tried with complete path:
pg_rewind.exe -D E:\WorkSpace\PostgreSQL\master\Data
--source-pgdata=E:\WorkSpace\PostgreSQL\master\Database1
The servers diverged at WAL position 0/1782830 on timeline 6.
Rewinding from last common checkpoint at 0/1782788 on timeline 6

could not open file "E:\WorkSpace\PostgreSQL\master\Data/base/12706/16395"
for truncation: No such file or directory
Failure, exiting

Another point is that after above error, target database
gets corrupted. Basically the target database contains
an extra data of source database and part of it's data.
I think thats because truncation didn't happened.

On retry it gives below message:
pg_rewind.exe -D ..\..\Data --source-pgdata=..\..\Database1

source and target cluster are on the same timeline
Failure, exiting

I think message displayed in this case is okay, however
displaying it as 'Failure' looks slightly odd.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2015-03-11 05:28:33 Re: Question about lazy_space_alloc() / linux over-commit
Previous Message Fujii Masao 2015-03-11 03:19:35 Re: [REVIEW] Re: Compression of full-page-writes