Re: Potential data loss of 2PC files

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Potential data loss of 2PC files
Date: 2017-03-17 02:26:27
Message-ID: CAFjFpRfsjKYCCetfrzuUk8MJM1hLAr=9eW9zXT4QbnMH6PwhFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 16, 2017 at 10:17 PM, David Steele <david(at)pgmasters(dot)net> wrote:
> On 2/13/17 12:10 AM, Michael Paquier wrote:
>> On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>>> If that can happen, don't we have the same problem in many other places?
>>>> Like, all the SLRUs? They don't fsync the directory either.
>>>
>>> Right, pg_commit_ts and pg_clog enter in this category.
>>
>> Implemented as attached.
>>
>>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>>>> not, then we need to fsync() the directory even if there are no files in it
>>>> at the moment, because some might've been removed earlier in the checkpoint
>>>> cycle.
>>>
>>> Hm... I am not an expert in file systems. At least on ext4 I can see
>>> that unlink() is atomic, but not durable. So if an unlink() is
>>> followed by a power failure, the previously unlinked file could be
>>> here if the parent directory is not fsync'd.
>>
>> So I have been doing more work on this patch, with the following things done:
>> - Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
>> ensure their durability.
>> - Create a durable_unlink() routine to give a way to perform a durable
>> file removal.
>> I am now counting 111 calls to unlink() in the backend code, and
>> looking at all of them most look fine with plain unlink() if they are
>> not made durable as they work on temporary files (see timeline.c for
>> example), with some exceptions:
>> - In pg_stop_backup, the old backup_label and tablespace_map removal
>> should be durable to avoid putting the system in a wrong state after
>> power loss. Other calls of unlink() are followed by durable_rename so
>> they are fine if let as such.
>> - Removal of old WAL segments should be durable as well. There is
>> already an effort to rename them durably in case of a segment
>> recycled. In case of a power loss, a file that should have been
>> removed could remain in pg_xlog.
>>
>> Looking around, I have bumped as well on the following bug report for
>> SQlite which is in the same category of things:
>> http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
>> Scary to see that in this case durability can be a problem at
>> transaction commit...
>
> This patch applies cleanly and compiles at cccbdde.
>
> Ashutosh, do you know when you'll have a chance to review?

The scope of this work has expanded, since last time I reviewed and
marked it as RFC. Right now I am busy with partition-wise joins and do
not have sufficient time to take a look at the expanded scope.
However, I can come back to this after partition-wise join, but that
may stretch to the end of the commitfest. Sorry.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-03-17 02:34:05 Re: logical decoding of two-phase transactions
Previous Message Michael Paquier 2017-03-17 02:21:02 Re: Renaming of pg_xlog and pg_clog