Re: Fwd: Data corruption after restarting replica

From: Novák, Petr <novakp(at)avast(dot)com>
To: dinesh kumar <dineshkumar02(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fwd: Data corruption after restarting replica
Date: 2015-02-19 14:36:40
Message-ID: CA+eEC0rQ2zcLEETkJGh_YfB_Ey5J6C=Sn_qWurbvbT5MWkwvjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-general

Hi Dinesh

On Wed, Feb 18, 2015 at 11:01 PM, dinesh kumar <dineshkumar02(at)gmail(dot)com> wrote:
> Hi,
>
> On Mon, Feb 16, 2015 at 2:44 AM, Novák, Petr <novakp(at)avast(dot)com> wrote:
>>
>> Hello,
>>
>> sorry for posting to second list, but as I've received no reply
>> there, I'm trying my luck here.
>>
>> Thanks
>> Petr
>>
>>
>> ---------- Forwarded message ----------
>> From: Novák, Petr <novakp(at)avast(dot)com>
>> Date: Tue, Feb 10, 2015 at 12:49 PM
>> Subject: Data corruption after restarting replica
>> To: pgsql-bugs(at)postgresql(dot)org
>>
>>
>> Hi all,
>>
>> we're experiencing data corruption after switching streamed replica to
>> primary.
>> This is not the first time I've encountered this issue, so I'l try to
>> describe it in more detail.
>>
>> For this particular cluster we have 6 servers in two datacenters (3 in
>> each). There are two instances running on each server, each with its
>> own port and datadir. On the first two servers in each datacenter one
>> instance is primary and the other is replica for the primary from the
>> other server. Third server holds two offsite replicas from the other
>> datacenter (for DR purposes)
>>
>> Each replica was set up by taking pg_basebackup from primary
>> (pg_basebackup -h <hostname> -p 5430 -D /data2/basebackup -P -v -U
>> <user> -x -c fast). Then directories from initdb were replaced with
>> the ones from basebackup (only the configuration files remained) and
>> the replica started and was successfully connected to primary. It was
>> running with no problem keeping up with the primary. We were
>> experiencing some connection problem between the two datacenters, but
>> replication didn't break.
>>
>> Then we needed to take one datacenter offline due to hardware
>> maintenance. So I've switched the applications down, verified that no
>> more clients were connected to primary, then shut the primary down and
>> restarted replica without recovery.conf and the application were
>> started using the new db with no problem. Other replica even
>> successfully reconnected to this new primary.
>>
>
> Before restarting replica, did you make sure that, all master transactions
> applied to replication node.

Yes.

> May we know, why did you restarted replica without recovery.conf. Do you
> want to maintain the same timeline for the xlogs. Or any specific other
> reasons. ??
>

Exactly, to preserve the xlog timeline.

> Regards,
> Dinesh
> manojadinesh.blogspot.com
>
>>
>> Few hours from the switch lines appeared in the server log (which
>> didn't appear before), indicating a corruption:
>>
>> ERROR: index "account_username_key" contains unexpected zero page at
>> block 1112135
>> ERROR: right sibling's left-link doesn't match: block 476354 links to
>> 1062443 instead of expected 250322 in index "account_pkey"
>>
>> ..and many more reporting corruption in several other indexes.
>>
>> The issue was resolved by creating new indexes and dropping the
>> affected ones, although there were already some duplicities in the
>> data, that has to be resolved, as some of the indexes were unique.
>>
>> This particular case uses Postgres 9.1.14 on both primary and replica.
>> But I've experienced similar behavior on 9.2.9. OS Centos 6.6 in all
>> cases. This may mean, that there can be something wrong with our
>> configuration or the replication setup steps, but I've set up another
>> instance using the same steps with no problem.
>>
>> Fsync related setting are at their defaults. Data directories are on
>> RAID10 arrays, with BBUs. Filesystem is ext4 mounted with nobarrier
>> option.
>>
>> Database is fairly large ~120GB with several 50mil+ tables, lots of
>> indexes and FK constraints. It is mostly queried,
>> updates/inserts/deletes are only several rows/s.
>>
>> Any help will be appreciated.
>>
>> Petr Novak
>>
>> System Engineer
>> Avast s.r.o.
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Arne Scheffer 2015-02-19 15:06:33 Re: BUG #12769: SSL-Renegotiation failures
Previous Message Novák 2015-02-19 14:31:26 Re: Fwd: Data corruption after restarting replica

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2015-02-19 14:54:19 Re: Failure loading materialized view with pg_restore
Previous Message Novák 2015-02-19 14:31:26 Re: Fwd: Data corruption after restarting replica