Re: BUG #15346: Replica fails to start after the crash

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15346: Replica fails to start after the crash
Date: 2018-08-28 02:44:09
Message-ID: 20180828024409.GB29157@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
> Why the number of tuples in the xlog is greater than the number of
> tuples on the index page?
> Because this page was already overwritten and its LSN is HIGHER than
> the current LSN!

That's annoying. Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages. How was the base backup of this node created? Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available. So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?

> Is there a way to recover from such a situation? Should the postgres
> in such case do comparison of LSNs and if the LSN on the page is
> higher than the current LSN simply return InvalidTransactionId?
> Apparently, if there are no connections open postgres simply is not
> running this code and it seems ok.

One idea I have would be to copy all the WAL segments up to the point
where the pages to-be-updated are, and let Postgres replay all the local
WALs first. However it is hard to say if that would be enough, as you
could have more references to pages even newer than the btree one you
just found.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-08-28 03:20:40 Re: BUG #15347: Unaccent for greek characters does not work
Previous Message Tom Lane 2018-08-28 02:40:17 Re: BUG #15350: Getting invalid cache ID: 11 Errors

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-08-28 02:45:49 Re: Why hash OIDs?
Previous Message Michael Paquier 2018-08-28 02:38:19 Re: [HACKERS] Proposal to add work_mem option to postgres_fdw module