Using hashtext and unique constraints together

From: "Mason Hale" <masonhale(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Using hashtext and unique constraints together
Date: 2007-12-11 15:55:29
Message-ID: 8bca3aa10712110755o4f6d7c9m8349e7cd50914dd0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I recently discovered the hashtext() function, and I'm interested in using
it to reduce the size of one of the biggest indexes in my database.

I have a table of URLs with a unique index on the URL. Some of these URLs
are very long (we truncate them at 1024 characters), and we have many
millions of these in our database, and so the index is very, very big.

We've considered building an index using an MD5 hash, but the hashtext()
function seems better because it hashes to a 4-byte integer.

The problem with either MD5 or hashtext() is that neither can guarantee
unique output even if the inputs are all unique.

For SELECTs this is no problem, as I could build an index on hashtext(url),
and to write a query to find 'http://www.postgresql.org' would look
something like:

SELECT * FROM page WHERE hashtext(url) = hashtext('
http://www.postgresql.org') AND url = 'http://www.postgresql.org'

The hashtext(url) condition will hit the index, and in the rare event it
returns more than one row, the url = conditions will pick out the matching
one from those.

The problem I need help with is guaranteeing uniqueness of the URL on
insert, with a non-unique index on hashtext(url) and *without* creating a
unique index on page(url).

I'm thinking that an insert trigger that ensures (SELECT count(*) FROM page
WHERE hashtext(url) = hashtext('http://www.postgresql.org') AND url = '
http://www.postgresql.org' ) = 0 won't work given MVCC, as two transactions
could simultaneously insert the same url at the same time.

Can anyone think of a way to guarantee uniqueness of the URL without adding
a unique constraint on the page(url) column?

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joshua D. Drake 2007-12-11 16:19:49 Re: Hijack!
Previous Message Gregory Stark 2007-12-11 15:44:04 Re: partitioned table query question