Re: Using hashtext and unique constraints together

From: "Mason Hale" <masonhale(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Using hashtext and unique constraints together
Date: 2007-12-12 00:00:16
Message-ID: 8bca3aa10712111600w70fe53bfg202ae2d55e7818cd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I recently discovered the hashtext() function, and I'm interested in using
> it to reduce the size of one of the biggest indexes in my database.
>
...

The problem with either MD5 or hashtext() is that neither can guarantee
> unique output even if the inputs are all unique.
>
...

>
> The problem I need help with is guaranteeing uniqueness of the URL on
> insert, with a non-unique index on hashtext(url) and *without* creating a
> unique index on page(url).
>
> I'm thinking that an insert trigger that ensures (SELECT count(*) FROM
> page WHERE hashtext(url) = hashtext('http://www.postgresql.org'<http://www.postgresql.org%27>)
> AND url = ' http://www.postgresql.org' ) = 0 won't work given MVCC, as two
> transactions could simultaneously insert the same url at the same time.
>
> Can anyone think of a way to guarantee uniqueness of the URL without
> adding a unique constraint on the page(url) column?
>

I gather from the lack of response the answer is "no" ?

Sorry to respond to my own post... but I had an idea -- what if hash the
same URL two different ways, and create a unique compound index on both
hashes?

For example

CREATE unique index on page( hashtext(url), hashtext(md5(url)) )

The idea here is that the odds for two different string to hash to the same
value with hashtext *and* md5 is remote enough to rely on for uniqueness.
I'm converting the md5 to hashtext to save space.

or even

CREATE unique index on page( hashtext(url), hashtext(url || 'SALT') )

An alternative approach, assumes that two different strings that hash to the
same value, will not also hash to the same value if the concatenated with
some additional (constant) data. Again, doesn't absolutely guarantee
uniqueness, but seems like it should be safe enough. I guess it depends on
how truly random the hashtext output is.

Both of the above would require 8 bytes for each index entry (plus
overhead), which is still a big savings over our current implementation.

Any thoughts on these ideas?

thanks in advance,
Mason

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Gregory Williamson 2007-12-12 00:22:36 Re: Hijack!
Previous Message Gregory Stark 2007-12-11 23:55:40 Re: Hijack!