Re: Report: race conditions in WAL replay routines

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-05 21:29:20
Message-ID: CA+U5nM+ETyC1tAwyEnXtsZxtCQs0GAma5HtmXy+snp1CKX0KOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 5, 2012 at 9:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>>> if it fails to clean out tablespace directories.  This seems to me to be
>>> the height of folly, especially when the failure is more or less an
>>> expected case.  If the error occurs the database is dead in the water,
>>> because that error is actually a PANIC and will recur on subsequent
>>> restart attempts.  Therefore there is no way to recover short of manual
>>> intervention to clean out the non-empty directory.  And why are we
>>> pulling the fire alarm like this?  Well, uh, it's because we might fail
>>> to recover some disk space in the dropped tablespace.  Seems to me to be
>>> a lot better to just elog(LOG) and move on.  This is quite analogous to
>>> the case of failing to unlink a file after commit --- wasting disk space
>>> might be bad, but it's very much the lesser evil compared to this.
>
>> If the sysadmin is managing the db properly then this shouldn't ever
>> happen - the only cause is if the tablespace being dropped is being
>> used as a temp tablespace on the standby.
>
> Right, but that is an expected/foreseeable situation.  It should not
> lead to a dead-and-unrestartable database.
>
>> If you just LOG, when exactly would we get rid of the tablespace?
>
> The tablespace *is* gone, or at least its catalog entries are.  All we
> are trying to do here is release some underlying disk space.  It's
> exactly analogous to the case where we drop a table and then find (post
> commit) that unlinking the disk file fails for some weird reason.
> We've done what we can to clean the disk space and should just let it
> go --- there is no risk to database integrity in leaving some files
> behind, so killing the server is a huge overreaction.

I agree the tablespace entries are gone, but that won't stop existing
users from continuing.

If we're not sure of the reason why tablespace removal fails it
doesn't seem safe to continue to me.

But since this is a rare corner case, and we already try to remove
users, then LOG seems OK.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-02-05 22:23:18 Re: Report: race conditions in WAL replay routines
Previous Message Dan Scales 2012-02-05 21:17:15 Re: double writes using "double-write buffer" approach [WIP]