Re: When pg_rewind success, the database can't startup

From: hemin <min(dot)he(at)ww-it(dot)cn>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: <pgsql-bugs(at)postgresql(dot)org>, 朱贤文 <tony(dot)zhu(at)ww-it(dot)cn>
Subject: Re: When pg_rewind success, the database can't startup
Date: 2018-06-19 07:56:15
Message-ID: 2454B24E-4FAD-4E90-B9D1-800519B84B18@ww-it.cn
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks for your reply.

On Thu, Jun 14, 2018 at 05:30:20PM +0800, hemin wrote:

There is a primary standby cluster with async replication, when large

data inserting into the primary node, we stop the database by

hand.

How do you stop it?

[hemin]: pg_ctl -D $PGDATA stop

Then promote the standby node to be new primary node and insert

new data into it. Finally use pg_rewind to avoid WAL diverged

success, but the node can not to be startup with fallow error:

That looks like a correctly flow, roughly. Did you issue a manual

checkpoint on the promoted standby before running pg_rewind? That's

necessary to avoid confusing pg_rewind which uses the on-disk data of

the source's control file for sanity checks.

[hemin]: I do not checkpoint before running pg_rewind, because the checkpoint of rewound primary node is obviously slow than the promoted standby, But I will try it later.

“2018-06-06 14:40:18.686 CST [2687] FATAL: requested timeline 3 does

not contain minimum recovery point 0/DB35BE80 on timeline 1

This means that the instance used for recovery is not part of the

timeline you are trying to link to. In short, the timeline history of

your nodes may have been messed up.

[hemin]: All WAL file is exits. We can see minimum recovery point using pg_controldata, and it is the latest LSN in the promoted standby node’s timeline 3.

I think postgres want redo WAL from the common checkpoint 0/AEEE9460 on timeline 1 to minimum recovery point 0/DB35BE80 on timeline 3, and put above error.

(4) Standby Node: promote the standby node to be

primary:

Here you should issue a checkpoint manually on the promoted standby.

[hemin]: I will try it.

(5) Standby Node: inset 3,000,000 rows data into database use

pgbench to:

You should also be careful that the previous master, as known as the

instance which has been rewound and that you are trying to plug back

into the cluster, needs also WAL segments down from the last checkpoint

before WAL has forked on its new timeline.

[hemin]: After pg_rewind success and startup failed, the WAL file in pg_wal both primary node and standby node is the same.

Which version of Postgres is that? 9.5? Because if that's the case

pg_rewind in 9.5 is very primitive in the way it handles timeline jumps

and 9.6 got way smarter.

[hemin]: Both 9.6 and 10 have the problem. It is very easy to reproduce this problem with the steps I provide.

--

Michael

何敏

Call: 185.0821.2027 | Fax: 028.6143.1877 | Web: w3.ww-it.cn

成都文武信息技术有限公司|ChengDu WenWu Information Technology Inc.|WwIT

地址: 成都高新区天府软件园B区7栋611 |邮编:610041

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-06-19 08:00:57 Re: When pg_rewind success, the database can't startup
Previous Message Michael Paquier 2018-06-19 07:09:56 Re: BUG #14999: pg_rewind corrupts control file global/pg_control