Re: silent data loss with ext4 / all current versions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2015-11-29 14:33:31
Message-ID: 565B0CBB.4090406@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 11/29/2015 02:38 PM, Craig Ringer wrote:
> On 27 November 2015 at 21:28, Greg Stark <stark(at)mit(dot)edu
> <mailto:stark(at)mit(dot)edu>> wrote:
>
> On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>>
> wrote:
> > I plan to do more power failure testing soon, with more complex test
> > scenarios. I suspect there might be other similar issues (e.g. when we
> > rename a file before a checkpoint and don't fsync the directory - then the
> > rename won't be replayed and will be lost).
>
> I'm curious how you're doing this testing. The easiest way I can think
> of would be to run a database on an LVM volume and take a large number
> of LVM snapshots very rapidly and then see if the database can start
> up from each snapshot. Bonus points for keeping track of the committed
> transactions before each snaphsot and ensuring they're still there I
> guess.
>
>
> I've had a few tries at implementing a qemu-based crashtester where it
> hard kills the qemu instance at a random point then starts it back up.

I've tried to reproduce the issue by killing a qemu VM, and so far I've
been unsuccessful. On bare HW it was easily reproducible (I'd hit the
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
somehow interacts with the I/O.

> I always got stuck on the validation part - actually ensuring that the
> DB state is how we expect. I think I could probably get that right now,
> it's been a while.

Weel, I guess we can't really check all the details, but I guess the
checksums make checking the general consistency somewhat simpler. And
then you have to design the workload in a way that makes the check
easier - for example remembering the committed values etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-11-29 14:43:28 Re: silent data loss with ext4 / all current versions
Previous Message Craig Ringer 2015-11-29 13:48:51 Re: How to add and use a static library within Postgres backend