Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-01-23 15:01:24
Message-ID: CAB7nPqQ+wHynKARTXP63SBv9oqbGxOz0Godx99YL=QGHY4FPbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 23, 2016 at 11:39 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 01/23/2016 02:35 AM, Michael Paquier wrote:
>>
>> On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>>> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> LVM snapshots would have the advantage that you can keep running the
>>> database and you can take lots of snapshots with relatively little
>>> overhead. Having dozens or hundreds of snapshots would be unacceptable
>>> performance drain in production but for testing it should be practical
>>> and they take relatively little space -- just the blocks changed since
>>> the snapshot was taken.
>>
>>
>> Another idea: hardcode a PANIC just after rename() with
>> restart_after_crash = off (this needs is IsBootstrapProcess() checks).
>> Once server crashes, kill-9 the VM. Then restart the VM and the
>> Postgres instance with a new binary that does not have the PANIC, and
>> see how things are moving on. There is a window of up to several
>> seconds after the rename() call, so I guess that this would work.
>
>
> I don't see how that would improve anything, as the PANIC has no impact on
> the I/O requests already issued to the system. What you need is some sort of
> coordination between the database and the script that kills the VM (or takes
> a LVM snapshot).

Well, to emulate the noise that non-renamed files have on the system
we could simply emulate the loss of rename() by just commenting it out
and then forcibly crash the instance or just PANIC the instance just
before rename(). This would emulate what we are looking for, no? What
we want to check is how the system reacts should an unwanted file be
in place.
For example, take the rename() call in InstallXLogFileSegment(),
crashing with an non-effective rename() will cause the presence of an
annoying xlogtemp file. Making the rename persistent would make the
server complain about an invalid magic number in a segment that has
just been created.
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-01-23 15:20:12 Re: proposal: function parse_ident
Previous Message Tom Lane 2016-01-23 15:00:25 Re: insert/update performance