Re: Hash Function: MD5 or other?

From: Peter Fein <pfein(at)pobox(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Hash Function: MD5 or other?
Date: 2005-06-14 13:33:34
Message-ID: 42AEDCAE.2070608@pobox.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Alex Stapleton wrote:
>
> On 13 Jun 2005, at 23:49, Peter Fein wrote:
>
>
>> Hi-
>>
>> I wanted to use a partially unique index (dependent on a flag) on a TEXT
>> column, but the index row size was too big for btrees. See the thread
>> "index row size 2728 exceeds btree maximum, 2713" from the beginning of
>> this month for someone with a similar problem. In it, someone suggests
>> indexing on a hash of the text. I'm fine with this, as the texts in
>> question are similar enough to each other to make collisions unlikely
>> and a collision won't really cause any serious problems.
>>
>> My question is: is the builtin MD5 appropriate for this use or should I
>> be using a function from pl/something? Figures on collision rates would
>> be nice as well - the typical chunk of text is probably 1k-8k.
>>
>> Thanks!
>>
>
> As others have said MD5 isn't the fastest one out there. However no
> cryptographically secure hashes are really that fast. However you can
> probably get away with using a CRC hash which is long enough to reduce
> your chances of collision a lot. However, PostgreSQL doesn't have a
> built in CRC function, which is a bit of a pain unless your prepared to
> implement one, or use pl/* to do it, which sounds like overkill. I
> suggest you run some benchmarks on MD5 and see if it's fast enough to
> meet your current (and perhaps future) needs.
>
> You could of course, just use a hash index on your text field! I think
> that would probably cope with larger data sets OK. It has the advantage
> of handling collisions for you as well :) However it means you have to
> transfer large amounts of data around, so if network speed ever becomes
> a limitation, MD5 hashing (or enabling compression on your PgSQL
> connection) may help.

Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with identical
sometext. Finding the appropriate group when inserting is very expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that running
a DISTINCT in this case would be more expensive than updating the index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

Hope that's clearer...

--Pete

--
Peter Fein pfein(at)pobox(dot)com 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alex Stapleton 2005-06-14 13:55:45 Re: Hash Function: MD5 or other?
Previous Message Együd Csaba 2005-06-14 11:44:32 Strange Create function behavior