Re: Hash Function: MD5 or other?

From: Alex Stapleton <alexs(at)advfn(dot)com>
To: Peter Fein <pfein(at)pobox(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Hash Function: MD5 or other?
Date: 2005-06-14 13:55:45
Message-ID: B5ED3DF0-D4FC-496F-9918-8866C231A244@advfn.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On 14 Jun 2005, at 14:33, Peter Fein wrote:

> Alex Stapleton wrote:
>
>>
>> On 13 Jun 2005, at 23:49, Peter Fein wrote:
>>
>>
>>
>>> Hi-
>>>
>>> I wanted to use a partially unique index (dependent on a flag) on
>>> a TEXT
>>> column, but the index row size was too big for btrees. See the
>>> thread
>>> "index row size 2728 exceeds btree maximum, 2713" from the
>>> beginning of
>>> this month for someone with a similar problem. In it, someone
>>> suggests
>>> indexing on a hash of the text. I'm fine with this, as the texts in
>>> question are similar enough to each other to make collisions
>>> unlikely
>>> and a collision won't really cause any serious problems.
>>>
>>> My question is: is the builtin MD5 appropriate for this use or
>>> should I
>>> be using a function from pl/something? Figures on collision
>>> rates would
>>> be nice as well - the typical chunk of text is probably 1k-8k.
>>>
>>> Thanks!
>>>
>>>
>>
>> As others have said MD5 isn't the fastest one out there. However no
>> cryptographically secure hashes are really that fast. However
>> you can
>> probably get away with using a CRC hash which is long enough to
>> reduce
>> your chances of collision a lot. However, PostgreSQL doesn't have a
>> built in CRC function, which is a bit of a pain unless your
>> prepared to
>> implement one, or use pl/* to do it, which sounds like overkill. I
>> suggest you run some benchmarks on MD5 and see if it's fast
>> enough to
>> meet your current (and perhaps future) needs.
>>
>> You could of course, just use a hash index on your text field! I
>> think
>> that would probably cope with larger data sets OK. It has the
>> advantage
>> of handling collisions for you as well :) However it means you
>> have to
>> transfer large amounts of data around, so if network speed ever
>> becomes
>> a limitation, MD5 hashing (or enabling compression on your PgSQL
>> connection) may help.
>>
>
> Let me clarify my problem:
>
> I've got a bunch of texts, stored in column sometext. These texts are
> assigned to groups, group_id. Each group may contain rows with
> identical
> sometext. Finding the appropriate group when inserting is very
> expensive
> and is calculated on the client side. I want to mark *one* of each of
> the exact duplicates in a group with a flag, group_representative.
> Basically, a partial index on:
>
> (group_id, sometext) WHERE group_representative=TRUE.
>
> The motivation is then to be able to do:
>
> SELECT * FROM mytable WHERE <some other conditions> AND
> group_representative=TRUE
>
> instead of:
>
> SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
> (group_id, sometext)
>
> since there are more of these queries than inserts (and they return a
> lot more rows than belong to a single group_id). I suspect that
> running
> a DISTINCT in this case would be more expensive than updating the
> index
> when inserting.
>
> Knowing the specifics of the data I'm putting in sometext, a halfway
> decent hash function would make collisions so rare as to make the
> chance
> insignificant (and collisions wouldn't break anything anyway). Is this
> approach reasonable, or should I use a hash index on (group_id,
> sometext) - does this suffer from the same size limitation as
> btrees? I
> thought hash indexes were slow...
>
> Hope that's clearer...

From the manual.

Testing has shown PostgreSQL's hash indexes to perform no better than
B-tree indexes, and the index size and build time for hash indexes is
much worse. For these reasons, hash index use is presently discouraged.

So yes, they are slower, but not by an enormous amount. (The lack of
any data to backup the manuals conclusions worries me, but
nevermind.) It is certainly worth trying.

The only hashing function PG has built in which you can actually
access is MD5. You might be able to get better performance from that
if data transmission takes a long time as you will only need to send
the hash to the database. You could then build a B-Tree index on the
MD5 hashes, which would certainly do what you want it to as the
likelihood of a collision is low. So basically it all boils down to
just how slow MD5 is, and just how fast your network connection to
your DB server(s) is.

> --Pete
>
> --
> Peter Fein pfein(at)pobox(dot)com
> 773-575-0694
>
> Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Shelby Cain 2005-06-14 14:15:43 Re: Hash Function: MD5 or other?
Previous Message Peter Fein 2005-06-14 13:33:34 Re: Hash Function: MD5 or other?