Re: COMMIT NOWAIT Performance Option

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COMMIT NOWAIT Performance Option
Date: 2007-02-27 10:49:32
Message-ID: 1172573373.3760.645.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2007-02-26 at 23:04 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not
> > threaten the consistency or robustness of other COMMIT modes. Read that
> > again and think about it, before we go further, please.
>
> I read that, and thought about it, and don't think I believe it.

I don't mind saying its taken a while to believe it myself. And I very
much want to have the concept tested so we all believe it. Doubt=>risk.

> The
> problem is that there are more interconnections between different
> transactions than you're allowing for. In particular you need to
> justify that the behavior is safe for non-transactional operations like
> btree page splits and pg_clog buffer page writes. The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk. "Database corrupt"
> is a situation that threatens all your transactions...

OK, I can see its possible for the following WAL history to occur:

t1 btree split part1
t2 COMMIT;
<---------- risk of inconsistent WAL
t1 btree split part2

The COMMIT by t2 flushes WAL, so if a crash happens at the point shown
we have an incomplete multi-part WAL operation. At this stage, lets
assume that no data writes by either t1 or t2 have made it to disk,
which is normal and likely.

On recovery, we apply all the changes in WAL up to the end of WAL. At
the last minute t2 sneaks in and gets committed. t1 was never committed,
so we never found out whether it would be a COMMIT NOWAIT or a COMMIT.
*Whichever* it will be the btree split is still incomplete and recovery
understands this and acts accordingly.

So the situation that sounds nasty is actually a normal situation
currently, so presents no threat because we already handle this
correctly. Multi-part operations seem good to me.

> The idea that's
> particularly bothering me at the moment is that after a system crash,
> we might come back up in a state where a NOWAIT transaction appears
> committed when its updates didn't all get to disk. "Database corrupt"
> is a situation that threatens all your transactions...

Well, I've spent 2 hours thinking this through, gone pale, thought hard
and gone down a few blind alleys along the way.

The bottom line is that COMMIT NOWAIT isn't fully safe, and thats part
of the explicit non-full guarantee, written in big letters on the tin.
If a transaction commits and then we crash before we flush WAL, then the
transaction will be incomplete. As we define COMMIT now, this is broken
and I use that word accurately: If you use COMMIT NOWAIT, you risk data
loss *but* you have the choice of which transactions this applies to.
However, WAL *is* consistent and there is no danger of database
corruption. Simply put, this is not group commit.

So if you issue COMMIT NOWAIT and then crash, the transactions that were
marked committed need to be marked aborted in clog. So during recovery
we will need to keep track of which transactions are in progress, so we
can mark them explicitly aborted, rather than the current implicit
mechanism. This would need to be done carefully, since a clog page that
was created at transaction start may never have made it to disk by the
time of the crash. We must extend clog in that case, even if the
extension WAL never made it to WAL, either, so that we can record the
aborted state of the Xids that exist somewhere on disk.

Keeping track of transactions in progress won't take long. It isn't
required at all during archive recovery, but that might form the basis
for later use as a Snapshot creation mechanism for read-only access
during startup.

It isn't possible for a normal non-readonly transaction to change data
that has been changed by a commit-nowait transaction, and then have the
normal transaction commit, yet without also implicitly committing the
commit-nowait transaction.

Hopefully, I've got that right?

> > New commit mode is available by explicit command, or as a default
> > setting that will be applied to all COMMITs, or both.
>
> I dislike introducing new nonstandard syntax ("Oracle compatible" is not
> standard). If we did this I'd vote for control via a GUC setting only;
> I think that is more useful anyway, as an application can be made to run
> with such a setting without invasive source code changes.

OK.

Having read through all of the above things again, ISTM that we should
make this functionality available by a new GUC commit_fsync_delay, which
must be set explicitly > 0 before this feature can be used at all. If I
confused Tom by using commit_delay, then I'll confuse others also and
group commit and deferred fsync are different techniques with different
robustness guarantees. When enabled it should have a clear message in
the log to show that some commits might be using commit_nowait.

I'd even welcome a more descriptive term that summed up the relaxed
transaction guarantee implied by the use of the deferred fsync
technique. Perhaps even a very explicit USERSET GUC:

transaction_guarantee = on (default) | off

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2007-02-27 11:03:12 Re:
Previous Message Dave Page 2007-02-27 10:13:12 Re: Seeking Google SoC Mentors