Re: Determining logically unique entities across many partially complete rows where at least one column matches

From: Jasen Betts <jasen(at)xnet(dot)co(dot)nz>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Determining logically unique entities across many partially complete rows where at least one column matches
Date: 2009-08-11 11:45:17
Message-ID: h5rlkd$e88$1@reversiblemaps.ath.cx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 2009-08-11, Jamie Tufnell <diesql(at)googlemail(dot)com> wrote:
> Hi,
>
> I am faced with a modeling problem and thought I'd see if anyone has run
> into something similar and can offer some advice.
>
> Basically my problem domain is cataloguing "snippets of information" about
> "entities" which are loosely identified.
>
> Entities can be identified up to 3 different methods (email, phone or
> openid.)
>
> Entities can have zero or many emails, phone numbers and openids. The
> only restriction is they must have at least one value in one of those three
> columns.
>
>
> Some sample data:
>
> snippet #1
> email: null
> phone: +1234567890
> openid: john(at)myopenid(dot)net
> information: This is snippet #1
>
> snippet #2
> email: foo(at)bar(dot)com
> phone: null
> openid: johnny.name
> information: This is snippet #2
>
> At this point snippet #1 and #2 could refer to different entities.
>
> snippet #3
> email: bar(at)baz(dot)com
> phone: +1234567890
> openid: johnny.name
> information: This is snippet #3
>
> But now all three snippets definitely refer to the same entity, as far as
> we're concerned:
>
> Entity: 1
> OpenIDs: johnny.name, john(at)myopenid(dot)net
> Phones: +1234567890
> Emails: foo(at)bar(dot)com, bar(at)baz(dot)com
>
> So as far as modeling this goes, I'm stuck between:
>
> 1. Normalizing as usual with some serious triggers to maintain the
> relationships.
> 2. Just having a snippets table with these fields inline and make these
> inferences at query time.
> 3. Something in between.
> 4. Using a document store like CouchDB.

I think three tables openid,email, phone
(phone's a bad one for a unique id IME, especially POTS lines)

create table (entid integer, phone text unique)

etc,etc...

then at insert time you use a rule that runs a function
with exception handling when the unique rules detect a match with some
existing data and then in the exception code you do updates to replace
the higher entid with the lower one

> The kinds of queries I need to do right now (which will no doubt change):
>
> * Return all snippets.
> * Return all distinct entities.
> * Find all id for a distinct entity given a single piece of id.
> * Find all snippets for a distinct entity.
>
> To do it in one table, I am thinking something like this:
>
> create table snippets (
> id serial not null primary key,
> email text,
> phone_number text,
> openid text,
> information text not null,
> check (email is not null or
> phone_number is not null or openid is not null)
> );
>
> with queries like:
>
> * Find all snippets for one distinct entity, searching by openid:
>
> select * from snippets
> where phone_number =
> (select phone_number from snippets where openid = 'john(at)myopenid(dot)net')
> or email =
> (select email from snippets where openid = 'john(at)myopenid(dot)net')
> or openid in
> (select openid from snippets
> where phone_number =
> (select phone_number from snippets where openid = 'john(at)myopenid(dot)net')
> or email =
> (select email from snippets where openid = 'john(at)myopenid(dot)net'));

but that won't get them all.

> Or if I was to model as usual I am thinking something like this:
>
> create table entities (
> id serial not null primary key
> );
>
> create table entity_has_email (
> entity_id integer not null
> references entities (id) on delete cascade on update cascade,
> email text not null unique
> );
>
> create table entity_has_phone_number (
> entity_id integer not null
> references entities (id) on delete cascade on update cascade,
> phone_number text not null unique
> );
>
> create table entity_has_openid (
> entity_id integer not null
> references entities (id) on delete cascade on update cascade,
> openid text not null unique
> );
>
> create table snippets (
> id serial not null primary key,
> entity_id integer not null
> references entities (id) on delete cascade on update cascade,
> information text not null
> );
>
> (followed by a mass of on insert/update/delete triggers)

>
> select s.* from snippets s
> join entity_has_email e on s.entity_id = e.id
> join entity_has_phone_number p on s.entity_id = p.id
> join entity_has_openid o on s.entity_id = o.id
> where o.openid = 'john(at)myopenid(dot)net';

looks like the wrong query for snippets to me.

select s.* from snippets s
join entity_has_openid o on s.entity_id = o.id
where o.openid = 'john(at)myopenid(dot)net';

> Another option, sort of half way between the two could be:
>
> create table snippets (
> id serial not null primary key,
> entity_id integer not null
> references entities (id) on delete cascade on update cascade,
> information text not null
> );
>
> create table entities (
> id serial not null primary key,
> email text,
> phone_number text,
> openid text,
> check (email is not null or
> phone_number is not null or openid is not null)
> );

that's not going to work with your example data.
(subject has two different email addresses)

I guess you could use arrays for email, openid, and phone.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Jan Verheyden 2009-08-11 13:01:31 mail alert
Previous Message Pavel Stehule 2009-08-11 10:55:05 Re: Updating a specific number of rows in pl/pgsql