Re: Potential data loss of 2PC files

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Potential data loss of 2PC files
Date: 2016-12-30 05:52:53
Message-ID: CAB7nPqQEPV1m+G_TRhsW3-0qde4YakVto40OJ+-eK=SgpwFUGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 29, 2016 at 6:41 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I agree with this.
> If no prepared transactions were required to be fsynced
> CheckPointTwoPhase(), do we want to still fsync the directory?
> Probably not.
>
> May be you want to call fsync_fname(TWOPHASE_DIR, true); if
> serialized_xacts > 0.

Definitely true for the non-recovery code path. But not for restart
points as there is no GXACT entry created by the redo routine of 2PC
prepare. We could have a static counter tracking how many 2PC files
have been flushed since last restart point or not but I am not
convinced if that's worth the facility.

> The comment just states what has been done earlier in the function,
> which is documented in the prologue as well.
> /*
> * Make sure that the content created is persistent on disk to prevent
> * data loss in case of an OS crash or a power failure. Each 2PC file
> * has been already created and flushed to disk individually by
> * RecreateTwoPhaseFile() using the list of GXACTs available for
> * normal processing as well as at recovery when replaying individually
> * each XLOG_XACT_PREPARE record.
> */
> Instead, you may want to just say "If we have flushed any 2PC files,
> flush the metadata in the pg_twophase directory."

That's not true for recovery. So I could go for something like that:
"If any 2PC files have been flushed, do the same for the parent
directory to make this information durable on disk. On recovery, issue
the fsync() anyway, individual 2PC files have already been flushed whe
replaying their respective XLOG_XACT_PREPARE record.

> Although, I thought that a simple case of creating a persistent table
> which requires creating a file also would need to flush the directory.
> I tried to locate the corresponding code to see if it also uses
> fsync_fname(). I couldn't locate the code. I have looked at
> mdcreate(), mdpreckpt(), mdpostckpt(). But may be that's not relevant
> here.

I have been playing with unlogged tables created, then followed by
checkpoints, and a VM plugged off but I could not see those files
getting away. That was on ext4, but like ext3 things can disappear if
the fsync() is forgotten on the parent directory.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2016-12-30 06:42:05 Re: proposal: session server side variables
Previous Message Regina Obe 2016-12-30 05:51:56 What is "index returned tuples in wrong order" for recheck supposed to guard against?