CRCs (was: beta testing version)

From: ncm(at)zembu(dot)com (Nathan Myers)
To: pgsql-hackers(at)postgresql(dot)org
Subject: CRCs (was: beta testing version)
Date: 2000-12-06 19:08:00
Message-ID: 20001206110800.Q30335@store.zembu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
> On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
> > Zeugswetter Andreas SB <ZeugswetterA(at)Wien(dot)Spardat(dot)at> writes:
> > > Yes, but there would need to be a way to verify the last page or
> > > record from txlog when running on crap hardware.
> >
> > How exactly *do* we determine where the end of the valid log data is,
> > anyway?
>
> I don't know how pgsql does it, but the only safe way I know of is to
> include an "end" marker after each record. When writing to the log,
> append the records after the last end marker, ending with another end
> marker, and fdatasync the log. Then overwrite the previous end marker
> to indicate it's not the end of the log any more and fdatasync again.
>
> To ensure that it is written atomically, the end marker must not cross a
> hardware sector boundary (typically 512 bytes). This can be trivially
> guaranteed by making the marker a single byte.

An "end" marker is not sufficient, unless all writes are done in
one-sector units with an fsync between, and the drive buffering
is turned off. For larger writes the OS will re-order the writes.
Most drives will re-order them too, even if the OS doesn't.

> Any other way I've seen discussed (here and elsewhere) either
> - Requires atomic multi-sector writes, which are possible only if all
> the sectors are sequential on disk, the kernel issues one large write
> for all of them, and you don't powerfail in the middle of the write.
> - Assume that a CRC is a guarantee.

We are already assuming a CRC is a guarantee.

The drive computes a CRC for each sector, and if the CRC is OK the
drive is happy. CRC errors within the drive are quite frequent, and
the drive re-reads when a bad CRC comes up. (If it sees errors too
frequently on a sector, it rewrites it; if it sees persistent errors
on a sector, it marks that one bad and relocates it.) You can expect
to experience, in production, about the error rate that the drive
manufacturer specifies as "maximum".

> ... A CRC would be a good addition to
> help ensure the data wasn't broken by flakey drive firmware, but
> doesn't guarantee consistency.

No, a CRC would be a good addition to compensate for sector write
reordering, which is done both by the OS and by the drive, even for
"atomic" writes.

It is not only "flaky" or "cheap" drives that re-order writes, or
acknowledge writes as complete that have are not yet on disk. You
can generally assume that *any* drive does it unless you have
specifically turned that off. The assumption is that if you care,
you have a UPS, or at least have configured the hardware yourself
to meet your needs.

It is purely wishful thinking to believe otherwise.

Nathan Myers
ncm(at)zembu(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Nathan Myers 2000-12-06 19:18:56 CRCs (was: beta testing version)
Previous Message Peter Eisentraut 2000-12-06 18:58:42 Re: How To Log User Name + Database?

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Myers 2000-12-06 19:18:56 CRCs (was: beta testing version)
Previous Message Oleg Bartunov 2000-12-06 18:11:08 Re: CVS: miscadmin.h is missing