Re: Fwd: Data corruption after restarting replica

From: dinesh kumar <dineshkumar02(at)gmail(dot)com>
To: Novák, Petr <novakp(at)avast(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fwd: Data corruption after restarting replica
Date: 2015-02-18 22:01:20
Message-ID: CALnrH7qw4Dib04JE_v0DGfQLGiSufU7-w5MAM0=Ys09JOfDSXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-general

Hi,

On Mon, Feb 16, 2015 at 2:44 AM, Novák, Petr <novakp(at)avast(dot)com> wrote:

> Hello,
>
> sorry for posting to second list, but as I've received no reply
> there, I'm trying my luck here.
>
> Thanks
> Petr
>
>
> ---------- Forwarded message ----------
> From: Novák, Petr <novakp(at)avast(dot)com>
> Date: Tue, Feb 10, 2015 at 12:49 PM
> Subject: Data corruption after restarting replica
> To: pgsql-bugs(at)postgresql(dot)org
>
>
> Hi all,
>
> we're experiencing data corruption after switching streamed replica to
> primary.
> This is not the first time I've encountered this issue, so I'l try to
> describe it in more detail.
>
> For this particular cluster we have 6 servers in two datacenters (3 in
> each). There are two instances running on each server, each with its
> own port and datadir. On the first two servers in each datacenter one
> instance is primary and the other is replica for the primary from the
> other server. Third server holds two offsite replicas from the other
> datacenter (for DR purposes)
>
> Each replica was set up by taking pg_basebackup from primary
> (pg_basebackup -h <hostname> -p 5430 -D /data2/basebackup -P -v -U
> <user> -x -c fast). Then directories from initdb were replaced with
> the ones from basebackup (only the configuration files remained) and
> the replica started and was successfully connected to primary. It was
> running with no problem keeping up with the primary. We were
> experiencing some connection problem between the two datacenters, but
> replication didn't break.
>
> Then we needed to take one datacenter offline due to hardware
> maintenance. So I've switched the applications down, verified that no
> more clients were connected to primary, then shut the primary down and
> restarted replica without recovery.conf and the application were
> started using the new db with no problem. Other replica even
> successfully reconnected to this new primary.
>
>
Before restarting replica, did you make sure that, all master transactions
applied to replication node.
May we know, why did you restarted replica without recovery.conf. Do you
want to maintain the same timeline for the xlogs. Or any specific other
reasons. ??

Regards,
Dinesh
manojadinesh.blogspot.com

> Few hours from the switch lines appeared in the server log (which
> didn't appear before), indicating a corruption:
>
> ERROR: index "account_username_key" contains unexpected zero page at
> block 1112135
> ERROR: right sibling's left-link doesn't match: block 476354 links to
> 1062443 instead of expected 250322 in index "account_pkey"
>
> ..and many more reporting corruption in several other indexes.
>
> The issue was resolved by creating new indexes and dropping the
> affected ones, although there were already some duplicities in the
> data, that has to be resolved, as some of the indexes were unique.
>
> This particular case uses Postgres 9.1.14 on both primary and replica.
> But I've experienced similar behavior on 9.2.9. OS Centos 6.6 in all
> cases. This may mean, that there can be something wrong with our
> configuration or the replication setup steps, but I've set up another
> instance using the same steps with no problem.
>
> Fsync related setting are at their defaults. Data directories are on
> RAID10 arrays, with BBUs. Filesystem is ext4 mounted with nobarrier
> option.
>
> Database is fairly large ~120GB with several 50mil+ tables, lots of
> indexes and FK constraints. It is mostly queried,
> updates/inserts/deletes are only several rows/s.
>
> Any help will be appreciated.
>
> Petr Novak
>
> System Engineer
> Avast s.r.o.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2015-02-18 22:50:56 Re: Fwd: Data corruption after restarting replica
Previous Message Adrian Klaver 2015-02-18 21:25:27 Re: Fwd: Data corruption after restarting replica

Browse pgsql-general by date

  From Date Subject
Next Message Tomas Vondra 2015-02-18 22:50:56 Re: Fwd: Data corruption after restarting replica
Previous Message Guillaume Drolet 2015-02-18 21:48:05 Re: Starting new cluster from base backup