Re: BUG #15346: Replica fails to start after the crash

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15346: Replica fails to start after the crash
Date: 2018-08-28 06:21:57
Message-ID: CAFh8B=m0Bht-BfKmyzfxcivzjcqRd7BbNHeWthDveWwZ+DrV2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi Michael,

> That's annoying. Because that means that the control file of your
> server maps to a consistent point which is older than some of the
> relation pages. How was the base backup of this node created? Please
> remember that when taking a base backup from a standby, you should
> backup the control file last, as there is no control of end backup with
> records available. So it seems to me that the origin of your problem
> comes from an incorrect base backup expectation?

We are running the cluster of 3 nodes (m4.large + EBS volume for
PGDATA) on AWS. Replicas were initialized about a years ago with
pg_basebackup and working absolutely fine. In the past year I did a
few minor upgrades with switchover (first upgrade of the replicas,
switchover, and upgrade the former primary). The last switchover was
done on the August 19th. This instance was working as a replica for
about three days until the sudden crash of EC2 instance. On the new
instance we attached existing EBS volume with existing the PGDATA and
tried to start postgres. Consequences you can see in the very first
email.

> One idea I have would be to copy all the WAL segments up to the point
> where the pages to-be-updated are, and let Postgres replay all the local
> WALs first. However it is hard to say if that would be enough, as you
> could have more references to pages even newer than the btree one you
> just found.

Well, I did some experiments, among them was the approach you suggest,
i.e. I commented out restore_command in the recovery.conf and copied
quite a few WAL segments to the pg_xlog. Results are the same. It
aborts as long as there are connections open :(

Regards,
--
Alexander Kukushkin

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-08-28 06:53:43 BUG #15355: for sonar integration
Previous Message Andres Freund 2018-08-28 05:08:33 Re: BUG #15346: Replica fails to start after the crash

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Kukushkin 2018-08-28 06:33:11 Re: Would it be possible to have parallel archiving?
Previous Message hubert depesz lubaczewski 2018-08-28 06:02:21 Would it be possible to have parallel archiving?