From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu> |
Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: silent data loss with ext4 / all current versions |
Date: | 2015-11-29 14:33:31 |
Message-ID: | 565B0CBB.4090406@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 11/29/2015 02:38 PM, Craig Ringer wrote:
> On 27 November 2015 at 21:28, Greg Stark <stark(at)mit(dot)edu
> <mailto:stark(at)mit(dot)edu>> wrote:
>
> On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>>
> wrote:
> > I plan to do more power failure testing soon, with more complex test
> > scenarios. I suspect there might be other similar issues (e.g. when we
> > rename a file before a checkpoint and don't fsync the directory - then the
> > rename won't be replayed and will be lost).
>
> I'm curious how you're doing this testing. The easiest way I can think
> of would be to run a database on an LVM volume and take a large number
> of LVM snapshots very rapidly and then see if the database can start
> up from each snapshot. Bonus points for keeping track of the committed
> transactions before each snaphsot and ensuring they're still there I
> guess.
>
>
> I've had a few tries at implementing a qemu-based crashtester where it
> hard kills the qemu instance at a random point then starts it back up.
I've tried to reproduce the issue by killing a qemu VM, and so far I've
been unsuccessful. On bare HW it was easily reproducible (I'd hit the
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
somehow interacts with the I/O.
> I always got stuck on the validation part - actually ensuring that the
> DB state is how we expect. I think I could probably get that right now,
> it's been a while.
Weel, I guess we can't really check all the details, but I guess the
checksums make checking the general consistency somewhat simpler. And
then you have to design the workload in a way that makes the check
easier - for example remembering the committed values etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2015-11-29 14:43:28 | Re: silent data loss with ext4 / all current versions |
Previous Message | Craig Ringer | 2015-11-29 13:48:51 | Re: How to add and use a static library within Postgres backend |