Re: surrogate key or not?

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net>, pgsql-sql(at)postgresql(dot)org
Subject: Re: surrogate key or not?
Date: 2004-07-22 17:05:59
Message-ID: 200407221005.59482.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Kasten,

> I have until now used surrogate primary keys on all table like
> so:

<snip>

> Short question: Is this OK re your concerns for using
> surrogates, eg. using a surrogate but making sure that at any
> one time there *would* be a real primary key candidate ?

Yes, this is an example of that. You have a long text field as part of the
key, and that would kill you performance-wise if diagnosis was referred to in
other tables and joined in views.

Keep in mind, though, that if a table is an "end node"; that is, if its PK is
not used as an FK by any other table, then worries about the performance of
keys and size of indexes are unfounded. In fact, for such tables, the
surrogate key is a performance drag; it adds a column and an index which are
not needed.

Now, addressing your table, I would have concerns other than the use of
primary keys. I suggest humbly that your data model/business logic may
need some development:

create table diagnosis (
pk serial primary key,
fk_patient integer
not null
references patient(pk)
on update cascade
on delete cascade,
narrative text
not null,
unique(fk_patient, narrative)
);

This was obviously created so that a patient could have multiple diagnoses.
However, there is no information in the table to indicate *why* there are
multiple diagnoses. And you are using a real key based on a long text
field; always hazardous, as there are many ways to phrase the same
information and duplication is likely. To do it in english, your postulates
look like:

PATIENT 67 was given a diagnosis of WATER ON THE KNEE.
PATIENT 456 was given a diagnosis of ACUTE HYPOCHONDRIA.

But this is a bit sketchy. Who made these diagnoses? When did they make
them? Why? This table could carry a *lot* more information, and should (sql
is shorthand)

create table diagnosis (
pk serial primary key,
fk_patient integer references patient(pk),
fk_visit integer references visits(pk),
fk_complaint integer references complaints(pk)
fk_staff integer references medical_staff(pk)
narrative text,
unique(fk_patient, fk_visit, fk_complaint, fk_staff)
);

Then your postulates become *much* more informative:

PATIENT 67 was given a diagnosis by STAFF MEMBER 12 on his VISIT #3
in response to NOT BEING ABLE TO WALK of WATER ON THE KNEE
PATIENT 456 was given a diagnosis by STAFF MEMBER 19 on his VISIT #192
in response to THE CREEPY-CRAWLIES of ACUTE HYPOCHONDRIA

It also allows you to establish a much more useful key; it's reasonable to
expect that a single staff member on one visit in response to one complaint
would only give one diagnosis. Otherwise, you have more than database
problems. And it prevents you from having to rely on a flaky long text key.

--
Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2004-07-22 17:10:38 Re: LIKE on index not working
Previous Message Gerardo Castillo 2004-07-22 16:49:32 Problem with transaction in functions and tempory tables