Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

From: Christophe Pettus <xof(at)thebuild(dot)com>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 18:58:26
Message-ID: 0E76EE0A-1740-41B5-88FF-54AA98794532@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on client systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).

The details of each incident are similar, but not identical.

The details of each incident are:

INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of an existing, correct primary (P1) for the base backup, and using WAL-E for WAL segment shipping. Both the primary and secondary were running 9.0.14. S1 properly connected to the primary once the it was caught up on WAL segments, and S1 was then promoted as a primary using the trigger file.

No errors in the log files on either system.

After promotion, it was discovered that there was significant data loss on S1. Rows that were present on P1 were missing on S1, and some rows were duplicated (including duplicates that violated primary key and other unique constraints). The indexes were corrupt, in that they seemed to think that the duplicates were not duplicated, and that the missing rows were still present.

Because the client's schema included a "last_updated" field, we were able to determine that all of the rows that were either missing or duplicated had been updated on P1 shortly (3-5 minutes) before S1 was promoted. It's possible, but not confirmed, that there were active queries (including updates) running on P1 at the moment of S1's promotion.

As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), and no data corruption was observed.

P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.

P1 and S1 have been destroyed at this point.

INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and unique constraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were added again. The database, a new primary, was then put back into production, and ran without incident.

A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream, although the WAL-E archiving was still present.

S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although this time without the duplicated rows, just missing rows). At this point, it's not clear if there was some clustering in the "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed.

P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.

No errors in the log files on either system.

P2 and S2 are still operational.

INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 system (P3) to a new 9.2.5 system (S3) using streaming replication. As I personally didn't do the steps on this one, I don't have quite as much information, but the basics are close to incident #2: When S3 was promoted using the trigger file, no errors were observed and the database came up normally, but rows were missing from S3 that were present on P3.

P1 is running Centos 6.3 with ext4 as the file system.

P2 is running Centos 6.4 with ext3 as the file system.

Log shipping in this case was done via rsync.

P3 and S3 are still operational.

No errors in the log files on either system.

--

Obviously, we're very concerned that a bug was introduced in the latest minor release. We're happy to gather data as required to assist in diagnosing this.
--
-- Christophe Pettus
xof(at)thebuild(dot)com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2013-11-18 19:00:20 Re: additional json functionality
Previous Message Peter Geoghegan 2013-11-18 18:56:56 Re: Improvement of pg_stat_statement usage about buffer hit ratio