Re: Broken hint bits (freeze)

From: Vladimir Borodin <root(at)simply(dot)name>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dmitriy Sarafannikov <dsarafannikov(at)yandex(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Broken hint bits (freeze)
Date: 2017-05-26 20:25:35
Message-ID: 9FD5B557-25D5-4199-92CE-C278AEBF97BB@simply.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> 26 мая 2017 г., в 23:04, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> написал(а):
>
> On Fri, May 26, 2017 at 2:39 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> Yeah, I think this is quite suspicious. This seems to indicate that
>> not all WAL records are replicated before the switchover. What is the
>> value of "synchronous_commit" you are using? I think you somehow need
>> to ensure before switchover that all the WAL is replicated to ensure
>> this is not a setup problem.
>
> It is so easy to corrupt a server because of an incorrect base backup
> flow or an incorrect switchover that it would be good first to
> understand how you are doing your switchover. Any corruption happening
> after a promotion, a failover or a switchover may be the top of the
> iceberg of what's on the data pages, and you may just see one symptom
> among other problems. Particularly, did you kill the master in any
> violent way after promoting the standby? Has the former master been
> able to perform at least once a clean shutdown checkpoint and has it
> been rewound?

At first we cleanly stop the former master and then we extra check that all data has been received by the standby to be promoted. Old master is returned by simply generating recovery.conf and starting it, all other standbys are done in the same way. WAL history on all hosts and in archive remains linear, no pg_rewind is needed. This procedure has been well tested on 9.3 (when the ability to do such switchover without restoring standbys from backup appeared), automated and has not changed for all these years.

Actually we have already found that LSNs of all corrupted tuples are somewhere near the time of upgrade from 9.5 to 9.6. There is still no evidence but it seems that it is mostly related to upgrade procedure. We now extract backups of 9.5 and 9.6 for a database where we now have corrupted pages to check this version. But it is still not obvious if it could be a pg_upgrade bug or our mistake in a way we did upgrade.

> --
> Michael

--
May the force be with you…
https://simply.name

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-05-26 20:47:13 Re: logical replication and PANIC during shutdown checkpoint in publisher
Previous Message Michael Paquier 2017-05-26 20:04:32 Re: Broken hint bits (freeze)