Re: cyclical redundancy checksum algorithm(s)?

From: Jonathan Leffler <jleffler(at)earthlink(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: cyclical redundancy checksum algorithm(s)?
Date: 2006-09-28 03:53:49
Message-ID: hHHSg.13580$v%4.500@newsread1.news.pas.earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Karen Hill wrote:
> Tom Lane wrote:
>> "Karen Hill" <karen_hill22(at)yahoo(dot)com> writes:
>>> Ralph Kimball states that this is a way to check for changes. You just
>>> have an extra column for the crc checksum. When you go to update data,
>>> generate a crc checksum and compare it to the one in the crc column.
>>> If they are same, your data has not changed.
>> You sure that's actually what he said? A change in CRC proves the data
>> changed, but lack of a change does not prove it didn't.
>
>
> On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
> Ralph Kimball writes the following:
>
> "Rather than checking each field to see if something has changed, we
> instead compute a checksum for the entire row all at once. A cyclic
> redundancy checksum (CRC) algorithm helps us quickly recognize that a
> wide messy row has changed without looking at each of its constituent
> fields."
>
> On page 360 he writes:
>
> "To quickly determine if rows have changed, we rely on a cyclic
> redundancy checksum (CRC) algorithm. If the CRC is identical for the
> extracted record and the most recent row in the master table, then we
> ignore the extracted record. We don't need to check every column to be
> certain that the two rows match exactly."
>
>> People do sometimes use this logic in connection with much wider
>> "summary" functions, such as an MD5 hash. I wouldn't trust it at all
>> with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of
>> collision.

An MD5 hash value has 128 bits; an SHA1 hash value has 160 bits.
Roughly speaking, a 32-bit check sum gives you a 1 in 4 billion chance
that a change won't be detected; a 64-bit check sum gives you a 1 in 16
billion billion chance; and a 128-bit check sum therefore a 1 in 256
billion billion billion billion chance. Every hash is subject to
collisions - there are an indefinitely large number of possible input
values (of widely differing lengths) that produce the same output.
However, the chance of seeing two such values is very improbable.

And, don't forget, if you update a row with a few changes (so some bytes
change but most do not), the chance that the new row produces the same
checksum as the old row is very small with a well-designed checksum.
Most updates will produce small changes in the data, but big changes in
the checksum.

The other issue is how long does it take to compute the checksum
compared with doing a field-by-field check? There are many facets to
that answer - related to caching of old values and comparison with the new.

--
Jonathan Leffler #include <disclaimer.h>
Email: jleffler(at)earthlink(dot)net, jleffler(at)us(dot)ibm(dot)com
Guardian of DBD::Informix v2005.02 -- http://dbi.perl.org/

In response to

Browse pgsql-general by date

  From Date Subject
Next Message blackjadelin 2006-09-28 04:02:47 Re: cannt setup postgresql database for my opennms
Previous Message David Fetter 2006-09-28 02:54:45 Re: dbi-link questions + patch