The dangers of streaming across versions of glibc: A cautionary tale

From: Matthew Kelly <mkelly(at)tripadvisor(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Cc: Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-06 21:24:17
Message-ID: BA6132ED-1F6B-4A0B-AC22-81278F5AB81E@tripadvisor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

The following is a real critical problem that we ran into here at TripAdvisor, but have yet figured out a clear way to mitigate.

TL;DR:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.

Context:
We were doing a hardware upgrade on a large internal machine a couple months ago. We followed a common procedure here: stand up a the new HA pair as streaming replica's of the old system; then failover to the new pair. All systems involved were running 9.1.9 (though that is not relevant as we'll see), and built from source.

Immediately, after the failover we saw some weird cases with some small indexes. We thought it was because the streaming replication failover had gone poorly (and because we weren't running latest version of postgres on that machine), so we rebuilt them and moved on. Until last week when an important query stopped getting optimized as a hash join and turned into a merge join. From that query I generated a simple, single column join between two tables. That query returns 50 million rows with merge joins disabled and 0 rows with them enabled. Rebuilding the index fixed the issue, but this was an important table and so we did some digging.

Using some query optimizer coercion, I was able to show that 1. the "corrupt" index had the same number of rows as the table, and 2. the index returned rows in a different, but nearly identical ordering to the one that you would receive by explicitly sorting the column. Taking a pair of rows that were out of place, I manage to narrow the issue down. Luckily, we able to find the old server sitting on the floor. The simplest form of this issue is:

SELECT 'M' > 'ஐ';

Root cause:
Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.

Digging further lead me to: master:src/backend/utils/adt/varlena.c:1494,1497 These are the lines where postgres calls strcoll_l and strcoll, in order to sort strings in a locale aware manner.

In the simplest case, the attached c file returns inconsistent results across glibc versions and environments. It just sets the collation to 'en_US.UTF-8', and then compares two one character strings (the characters above). Depending on the version of glibc you are running, you may see positive, negative or zero as the output. I have observed:

Old Server (CentOS 5.8, kernel 2.6.18-308.24.1.el5):
0 -> glibc-2.5-81.el5_8.7
1 -> glibc-devel-2.5-81.el5_8.7 statically linked
0 -> Source builds of glibc (2.5, 2.6, 2.10)

New Server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.132.el6.x86_64

Dev server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.107.el6_4.5.x86_64
-1 -> Source build (2.12.2, 2.16, 2.18)
15 -> Source build (HEAD)

Laptop (Ubuntu, kernel 3.11.0-12-generic)
-1 -> 2.17-93ubuntu4
15 -> 2.17-93ubuntu4 statically linked

Mac OS (For comparison only)
62365 -> OSX 10.8, 10.9

From my digging, I have been unable to figure out why glibc is returning different results in different situations. It is probably worth getting a discussion going on their mailing lists as well.

Regardless, the reality is that there are different versions of glibc out there in the wild, and they do not sort consistently across versions/environments. Streaming replica's rely on the assumption that the sort order within a collation is consistent across machines. If they differ ever so slightly, then a single pair of rows with poison pill characters can yield an entire index invalid. We were lucky that the first discrepancy was at the beginning of the index. If it was 85% of the way through then we probably would never have noticed that merge joins were broken on that machine.

We still are discussing internally how we plan to prevent this in the future, but we'd like to open this up to the community for wider discussion.

- Matt K, TripAdvisor

Attachment Content-Type Size
localetest.c application/octet-stream 387 bytes

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gregory Taylor 2014-08-06 21:28:09 Recursive CTE trees + Sorting by votes
Previous Message David Johnston 2014-08-06 18:39:38 Re: Re: understanding why two nearly identical queries take two different planner routes, one 5s and one 2hr