Skip site navigation (1) Skip section navigation (2)

SQL assistance, please...

From: Eric Walstad <eric(at)ericwalstad(dot)com>
To: sfpug(at)postgresql(dot)org
Subject: SQL assistance, please...
Date: 2006-02-15 03:59:43
Message-ID: 200602141959.43785.eric@ericwalstad.com (view raw or flat)
Thread:
Lists: sfpug
Hello everyone,

My addresses table includes a hash of the address, which I use to speed 
searches.  The searches are working well, but occasionally I find that there 
are hash collisions like the following:

select distinct addr1, address_hash from addresses where address_hash = 
'-1394669317' order by addr1;
         addr1         | address_hash
-----------------------+--------------
 123 SOME STREET       | -1394669317
 543 ANOTHER STREET    | -1394669317

That is, totally different addresses happen to result in the same hash value.  
I'd like to find out how often this is happening so I can see if I need to 
change hash algorithms.


My question is: how can I find all such instances like this in my table?


I'd like something like:
where a.addr1 != b.addr1 and a.address_hash = b.address_hash
but I'm not sure how to join in the table twice in such a way that I don't get 
a ton of duplicate records.  Also, I'm not sure what this pattern is called, 
so I don't know what to google for.

Any pointers are greatly apprceiated.  Many thanks in advance,

Eric.


additional info...

The table looks like this:
uv2=# \d addresses

     Column      |           Type           |   Modifiers
-----------------+--------------------------+----------------------------------------------------------------------
 id              | integer                  | not null default 
nextval('public.addresses_id_seq'::text)
 addr1           | character varying(255)   | not null
 addr2           | character varying(255)   |
 city            | character varying(255)   |
 state           | character varying(2)     | not null
 zipcode         | character varying(5)     |
 zip4            | character varying(4)     |
 address_hash    | character varying(128)   |


The address hash is generated from a concatenation of addr1 - zip4 fields 
(normalized: case insensitive, no punctuation, common replacements (street -> 
st), etc.).  Ideally, I'd like to find cases where any of the address fields 
differ but the address_hash is the same.

Responses

sfpug by date

Next:From: Jeff FrostDate: 2006-02-15 04:45:34
Subject: Re: SQL assistance, please...
Previous:From: Eric WalstadDate: 2006-02-14 21:31:42
Subject: Re: invalid byte sequence for encoding "UNICODE": 0xd9

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group