| From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> | 
|---|---|
| To: | Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu> | 
| Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: silent data loss with ext4 / all current versions | 
| Date: | 2015-11-29 14:33:31 | 
| Message-ID: | 565B0CBB.4090406@2ndquadrant.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 11/29/2015 02:38 PM, Craig Ringer wrote:
> On 27 November 2015 at 21:28, Greg Stark <stark(at)mit(dot)edu
> <mailto:stark(at)mit(dot)edu>> wrote:
>
>     On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
>     <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>>
>     wrote:
>     > I plan to do more power failure testing soon, with more complex test
>     > scenarios. I suspect there might be other similar issues (e.g. when we
>     > rename a file before a checkpoint and don't fsync the directory - then the
>     > rename won't be replayed and will be lost).
>
>     I'm curious how you're doing this testing. The easiest way I can think
>     of would be to run a database on an LVM volume and take a large number
>     of LVM snapshots very rapidly and then see if the database can start
>     up from each snapshot. Bonus points for keeping track of the committed
>     transactions before each snaphsot and ensuring they're still there I
>     guess.
>
>
> I've had a few tries at implementing a qemu-based crashtester where it
> hard kills the qemu instance at a random point then starts it back up.
I've tried to reproduce the issue by killing a qemu VM, and so far I've 
been unsuccessful. On bare HW it was easily reproducible (I'd hit the 
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu 
somehow interacts with the I/O.
> I always got stuck on the validation part - actually ensuring that the
> DB state is how we expect. I think I could probably get that right now,
> it's been a while.
Weel, I guess we can't really check all the details, but I guess the 
checksums make checking the general consistency somewhat simpler. And 
then you have to design the workload in a way that makes the check 
easier - for example remembering the committed values etc.
regards
--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2015-11-29 14:43:28 | Re: silent data loss with ext4 / all current versions | 
| Previous Message | Craig Ringer | 2015-11-29 13:48:51 | Re: How to add and use a static library within Postgres backend |