SQL assistance, please...

From: Eric Walstad <eric(at)ericwalstad(dot)com>
To: sfpug(at)postgresql(dot)org
Subject: SQL assistance, please...
Date: 2006-02-15 03:59:43
Message-ID: 200602141959.43785.eric@ericwalstad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: sfpug

Hello everyone,

My addresses table includes a hash of the address, which I use to speed
searches. The searches are working well, but occasionally I find that there
are hash collisions like the following:

select distinct addr1, address_hash from addresses where address_hash =
'-1394669317' order by addr1;
addr1 | address_hash
-----------------------+--------------
123 SOME STREET | -1394669317
543 ANOTHER STREET | -1394669317

That is, totally different addresses happen to result in the same hash value.
I'd like to find out how often this is happening so I can see if I need to
change hash algorithms.

My question is: how can I find all such instances like this in my table?

I'd like something like:
where a.addr1 != b.addr1 and a.address_hash = b.address_hash
but I'm not sure how to join in the table twice in such a way that I don't get
a ton of duplicate records. Also, I'm not sure what this pattern is called,
so I don't know what to google for.

Any pointers are greatly apprceiated. Many thanks in advance,

Eric.

additional info...

The table looks like this:
uv2=# \d addresses

Column | Type | Modifiers
-----------------+--------------------------+----------------------------------------------------------------------
id | integer | not null default
nextval('public.addresses_id_seq'::text)
addr1 | character varying(255) | not null
addr2 | character varying(255) |
city | character varying(255) |
state | character varying(2) | not null
zipcode | character varying(5) |
zip4 | character varying(4) |
address_hash | character varying(128) |

The address hash is generated from a concatenation of addr1 - zip4 fields
(normalized: case insensitive, no punctuation, common replacements (street ->
st), etc.). Ideally, I'd like to find cases where any of the address fields
differ but the address_hash is the same.

Responses

Browse sfpug by date

  From Date Subject
Next Message Jeff Frost 2006-02-15 04:45:34 Re: SQL assistance, please...
Previous Message Eric Walstad 2006-02-14 21:31:42 Re: invalid byte sequence for encoding "UNICODE": 0xd9