Re: RelationCreateStorage can orphan files

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: RelationCreateStorage can orphan files
Date: 2010-09-16 02:15:16
Message-ID: AANLkTimHVpYtugbu=1UhxyiEHnqRQ5Dg8fGgEwGRu17f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I notice that RelationCreateStorage() creates the main fork on disk
>> before writing (let alone flushing) WAL.  So if PG gets killed at that
>> point, we end up with an orphaned file on disk.  I think that we could
>> even extend the relation a few times before WAL gets written, so I
>> don't even think it's necessarily a zero-size file.  We could perhaps
>> avoid this by writing and flushing a WAL record that includes the
>> creating XID before touching the disk; when we replay the record, we
>> create the file but then delete it if the XID fails to commit before
>> recovery ends.  But I guess maybe our feeling is that it's just not
>> worth taking a performance hit for this?
>
> That design is intentional.  If the file create fails, and you've
> already written a WAL record that says you created it, you are flat
> out screwed.  You can't even PANIC --- if you do, then the replay of
> the WAL record will likely fail and PANIC again, leaving the database
> dead in the water.

Not that this is perhaps more than of academic interest, but could you
get around this problem by making the replay of the XLOG record defer
the creation of the file until such time as it's actually written to
or the creating XID commits? And also, if the XID does not commit,
going back and trying to remove the file (on a best effort basis)?

> Orphaned files, in contrast, are completely non-dangerous --- the worst
> they can do is waste a little bit of disk space.  That's a cheap price
> to pay for not having an unrecoverable database after a create failure.
>
> This is essentially the same reason why CREATE DATABASE and related
> commands xlog directory copy operations only after completing them.
> That potentially wastes much more than a few blocks; but it's still
> non-dangerous, and far safer than the alternative.

Thanks for the explanation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Itagaki Takahiro 2010-09-16 02:20:03 Re: patch: SQL/MED(FDW) DDL
Previous Message Robert Haas 2010-09-16 02:05:49 Re: patch: SQL/MED(FDW) DDL