Re: COMMIT NOWAIT Performance Option

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject: Re: COMMIT NOWAIT Performance Option
Date: 2007-02-28 02:13:11
Message-ID: 87k5y3owqw.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Josh Berkus" <josh(at)agliodbs(dot)com> writes:

> It's a question of whether your HW+OS can guarentee no torn page writes for
> the xlog.

no, the data files.

torn pages in the xlog is also a problem but we protect ourselves with a CRC
and stop replay if it the CRC doesn't match. So the cost there is a bit of
cpu, not extra i/o.

> Running on Sun hardware combined with Solaris 10 with the xlog mounted
> forcedirectio, the Solaris folks are convinced we are torn-page-proof and so
> far we haven't been able to prove them wrong. And, on Solaris it's a
> substantial performance gain (like, 8-10% on OLTP benchmarks).

I would expect you to need a small non-volatile cache, either in the
controller or the drive itself to be torn-page-proof. Or failing that to have
drives that operate on 8kb sectors and guarantee that whole sectors get
written using residual power. I don't think any drives operate in 8k sectors
though.

The scary thing about torn pages with full_page_writes off is that we don't
offer any way to detect them. If both halves of the 8kb page look reasonable
you could conceivably end up continuing without ever knowing your data is
corrupt.

That could happen if, for example, the change that was being written isn't
very dramatic. Perhaps all that's missing is an update chain pointer for
example. So you could have two versions of the same record but be missing the
chain pointer in the old record. That would eventually lead to having two
visible versions of the same record but no crashes or other red flags.

I suggested a while back implementing torn page detection by writing a
sequential number ever 512 bytes in the blocks. (I was talking about WAL at
the time but the same principle applies.) Do it at the smgr layer using
readv/writev and the upper layers need never know their data wasn't contiguous
on disk. The only effect would be to shorten page sizes by 16 bytes which
would be annoying but much less so than full_page_writes.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-02-28 03:03:58 Re: [HACKERS]
Previous Message Joshua D. Drake 2007-02-28 01:57:16 Re: COMMIT NOWAIT Performance Option