Re: Why we lost Uber as a user

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Kevin Grittner <kgrittn(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Alfred Perlstein <alfred(at)freebsd(dot)org>, Geoff Winkless <pgsqladmin(at)geoff(dot)dj>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we lost Uber as a user
Date: 2016-08-19 14:08:25
Message-ID: CAHyXU0zkmRt2hi6Y9q1MWYHCYmx4XxO5k9oS7yN6Y-Qq5vZzjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 17, 2016 at 5:18 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> On 8/17/16 2:51 PM, Simon Riggs wrote:
>> On 17 August 2016 at 12:19, Greg Stark <stark(at)mit(dot)edu> wrote:
>>> Yes, this is exactly what it should be doing and exactly why it's
>>> useful. Physical replication accurately replicates the data from the
>>> master including "corruption" whereas a logical replication system
>>> will not, causing divergence and possible issues during a failover.
>>
>>
>> Yay! Completely agree.
>>
>> Physical replication, as used by DRBD and all other block-level HA
>> solutions, and also used by other databases, such as Oracle.
>>
>> Corruption on the master would often cause errors that would prevent
>> writes and therefore those changes wouldn't even be made, let alone be
>> replicated.
>
>
> My experience has been that you discover corruption after it's already
> safely on disk, and more than once I've been able to recover by using data
> on a londiste replica.
>
> As I said originally, it's critical to understand the different solutions
> and the pros and cons of each. There is no magic bullet.

Data point: in the half or so cases I've experienced corruption on
replicated systems, in all cases but one the standby was clean. The
'unclean' case actually 8.2 warm standby; the source of the corruption
was a very significant bug where prepared statements would write back
corrupted data if the table definitions changed under the statement
(fixed in 8.3). In that particular case the corruption was very
unfortunately quite widespread and passed directly along to the
standby server. This bug nearly costed us a user as well although not
nearly so famous as uber :-).

In the few modern cases I've seen I've not been able to trace it back
to any bug in postgres (in particular multixact was ruled out) and
I've chalked it up to media or (more likely I think) filesystem
problems in the face of a -9 reset.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua Bay 2016-08-19 14:16:35 Re: Most efficient way for libPQ .. PGresult serialization
Previous Message Kevin Grittner 2016-08-19 14:00:15 Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location