Re: 8192 BLCKSZ ?

From: Nathan Myers <ncm(at)zembu(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 8192 BLCKSZ ?
Date: 2000-11-28 21:01:34
Message-ID: 20001128130134.C22345@store.zembu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote:
> "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > I don't believe it's a performance issue, I believe it's that writes to
> > blocks greater than 8k cannot be guaranteed 'atomic' by the operating
> > system. Hence, 32k blocks would break the transactions system.
>
> As Nathan remarks nearby, it's hard to tell how big a write can be
> assumed atomic, unless you have considerable knowledge of your OS and
> hardware.

Not to harp on the subject, but even if you _do_ know a great deal
about your OS and hardware, you _still_ can't assume any write is
atomic.

To give an idea of what is involved, consider that modern disk
drives routinely re-order writes, by themselves. You think you
have asked for a sequential write of 8K bytes, or 16 sectors,
but the disk might write the first and last sectors first, and
then the middle sectors in random order. A block of all zeroes
might not be written at all, but just noted in the track metadata.

Most disks have a "feature" that they report the write complete
as soon as it is in the RAM cache, rather than after the sectors
are on the disk. (It's a "feature" because it makes their
benchmarks come out better.) It can usually be turned off, but
different vendors have different ways to do it. Have you turned
it off on your production drives?

In the event of a power outage, the drive will stop writing in
mid-sector. If you're lucky, that sector would have a bad checksum
if you tried to read it. If the half-written sector happens to
contain track metadata, you might have a bigger problem.

----
The short summary is: for power outage or OS-crash recovery purposes,
there is no such thing as atomicity. This is why backups and
transaction logs are important.

"Invest in a UPS." Use a reliable OS, and operate it in a way that
doesn't stress it. Even a well-built OS will behave oddly when
resources are badly stressed. (That the oddities may be documented
doesn't really help much.)

For performance purposes, it may be more or less efficient to group
writes into 4K, 8K, or 32K chunks. That's not a matter of database
atomicity, but of I/O optimization. It can only confuse people to
use "atomicity" in that context.

Nathan Myers
ncm(at)zembu(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-11-28 21:24:34 Re: 8192 BLCKSZ ?
Previous Message Hannu Krosing 2000-11-28 18:38:34 Re: beta testing version