Re: SQL assistance, please...

From: Jeff Frost <jeff(at)frostconsultingllc(dot)com>
To: Eric Walstad <eric(at)ericwalstad(dot)com>
Cc: sfpug(at)postgresql(dot)org
Subject: Re: SQL assistance, please...
Date: 2006-02-15 04:45:34
Message-ID: Pine.LNX.4.64.0602142042090.4513@discord.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: sfpug

Eric,

You probably want to run a query with a self join like this:

SELECT a1.addr1, a1.address_hash FROM address a1, address a2 WHERE
a1.address_hash = a2.address_hash AND
a1.addr1 != a2.addr1;

You should probably also consider placing a unique constraint on address_hash
so you'll get an error when attempting to insert a duplicate.

-Jeff

On Tue, 14 Feb 2006, Eric Walstad wrote:

> Hello everyone,
>
> My addresses table includes a hash of the address, which I use to speed
> searches. The searches are working well, but occasionally I find that there
> are hash collisions like the following:
>
> select distinct addr1, address_hash from addresses where address_hash =
> '-1394669317' order by addr1;
> addr1 | address_hash
> -----------------------+--------------
> 123 SOME STREET | -1394669317
> 543 ANOTHER STREET | -1394669317
>
> That is, totally different addresses happen to result in the same hash value.
> I'd like to find out how often this is happening so I can see if I need to
> change hash algorithms.
>
>
> My question is: how can I find all such instances like this in my table?
>
>
> I'd like something like:
> where a.addr1 != b.addr1 and a.address_hash = b.address_hash
> but I'm not sure how to join in the table twice in such a way that I don't get
> a ton of duplicate records. Also, I'm not sure what this pattern is called,
> so I don't know what to google for.
>
> Any pointers are greatly apprceiated. Many thanks in advance,
>
> Eric.
>
>
> additional info...
>
> The table looks like this:
> uv2=# \d addresses
>
> Column | Type | Modifiers
> -----------------+--------------------------+----------------------------------------------------------------------
> id | integer | not null default
> nextval('public.addresses_id_seq'::text)
> addr1 | character varying(255) | not null
> addr2 | character varying(255) |
> city | character varying(255) |
> state | character varying(2) | not null
> zipcode | character varying(5) |
> zip4 | character varying(4) |
> address_hash | character varying(128) |
>
>
> The address hash is generated from a concatenation of addr1 - zip4 fields
> (normalized: case insensitive, no punctuation, common replacements (street ->
> st), etc.). Ideally, I'd like to find cases where any of the address fields
> differ but the address_hash is the same.
>
>

--
Jeff Frost, Owner <jeff(at)frostconsultingllc(dot)com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

In response to

Responses

Browse sfpug by date

  From Date Subject
Next Message Eric Walstad 2006-02-15 06:34:25 Re: SQL assistance, please...
Previous Message Eric Walstad 2006-02-15 03:59:43 SQL assistance, please...