Re: silent data loss with ext4 / all current versions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-01-23 02:39:50
Message-ID: 56A2E7F6.9090902@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/23/2016 02:35 AM, Michael Paquier wrote:
> On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>>>
>>>> So, I have been playing with a Linux VM with VMware Fusion and on
>>>> ext4 with data=ordered the renames are getting lost if the root
>>>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>>>> really easily.
>>>
>>>
>>> Yep. Same experience here (with qemu-kvm VMs).
>>
>> I still think a better approach for this is to run the database on an
>> LVM volume and take lots of snapshots. No VM needed, though it doesn't
>> hurt. LVM volumes are below the level of the filesystem and a snapshot
>> captures the state of the raw blocks the filesystem has written to the
>> block layer. The block layer does no caching though the drive may but
>> neither the VM solution nor LVM would capture that.
>>
>> LVM snapshots would have the advantage that you can keep running the
>> database and you can take lots of snapshots with relatively little
>> overhead. Having dozens or hundreds of snapshots would be unacceptable
>> performance drain in production but for testing it should be practical
>> and they take relatively little space -- just the blocks changed since
>> the snapshot was taken.
>
> Another idea: hardcode a PANIC just after rename() with
> restart_after_crash = off (this needs is IsBootstrapProcess() checks).
> Once server crashes, kill-9 the VM. Then restart the VM and the
> Postgres instance with a new binary that does not have the PANIC, and
> see how things are moving on. There is a window of up to several
> seconds after the rename() call, so I guess that this would work.

I don't see how that would improve anything, as the PANIC has no impact
on the I/O requests already issued to the system. What you need is some
sort of coordination between the database and the script that kills the
VM (or takes a LVM snapshot).

That can be done by simply emitting a particular log message, and the
"kill script" may simply watch the file (for example over SSH). This has
the benefit that you can also watch for additional conditions that are
difficult to check from that particular part of the code (and only kill
the VM when all of them trigger - for example only on the third
checkpoint since start, and such).

The reason why I was not particularly thrilled about the LVM snapshot
idea is that to identify this particular data loss issue, you need to be
able to reason about the expected state of the database (what
transactions are committed, how many segments are there). And my
understanding was that Greg's idea was merely "try to start the DB on a
snapshot and see if starts / is not corrupted," which would not work
with this particular issue, as the database seemed just fine - the data
loss is silent. Adding the "last XLOG segment" into pg_controldata would
make it easier to detect without having to track details about which
transactions got committed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Singer 2016-01-23 03:17:40 Re: pglogical - logical replication contrib module
Previous Message Haribabu Kommi 2016-01-23 01:59:33 Re: Parallel Aggregate