Re: Potential data loss of 2PC files

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Potential data loss of 2PC files
Date: 2017-02-13 05:10:34
Message-ID: CAB7nPqTa75a6yqAONK7N1um5CaHy=X1X4=GArSbHbP1J4j9NKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> If that can happen, don't we have the same problem in many other places?
>> Like, all the SLRUs? They don't fsync the directory either.
>
> Right, pg_commit_ts and pg_clog enter in this category.

Implemented as attached.

>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>> not, then we need to fsync() the directory even if there are no files in it
>> at the moment, because some might've been removed earlier in the checkpoint
>> cycle.
>
> Hm... I am not an expert in file systems. At least on ext4 I can see
> that unlink() is atomic, but not durable. So if an unlink() is
> followed by a power failure, the previously unlinked file could be
> here if the parent directory is not fsync'd.

So I have been doing more work on this patch, with the following things done:
- Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
ensure their durability.
- Create a durable_unlink() routine to give a way to perform a durable
file removal.
I am now counting 111 calls to unlink() in the backend code, and
looking at all of them most look fine with plain unlink() if they are
not made durable as they work on temporary files (see timeline.c for
example), with some exceptions:
- In pg_stop_backup, the old backup_label and tablespace_map removal
should be durable to avoid putting the system in a wrong state after
power loss. Other calls of unlink() are followed by durable_rename so
they are fine if let as such.
- Removal of old WAL segments should be durable as well. There is
already an effort to rename them durably in case of a segment
recycled. In case of a power loss, a file that should have been
removed could remain in pg_xlog.

Looking around, I have bumped as well on the following bug report for
SQlite which is in the same category of things:
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
Scary to see that in this case durability can be a problem at
transaction commit...
--
Michael

Attachment Content-Type Size
unlink-2pc-loss-v3.patch application/octet-stream 6.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-02-13 05:21:50 Re: Documentation improvements for partitioning
Previous Message Okano, Naoki 2017-02-13 04:12:50 Keep ECPG comment for log_min_duration_statement