Re: surrogate key or not?

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: sad <sad(at)bankir(dot)ru>, <pgsql-sql(at)postgresql(dot)org>
Subject: Re: surrogate key or not?
Date: 2004-07-21 17:25:33
Message-ID: 200407211025.33615.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Sad,

First of all, please excuse me if I've misunderstood you below because of
translation issues. You'll find I'm rather strident, but it's because the
reasons you're presenting, or seem to be, are excuses for bad database design
I hear every day on the job, and end up having to fix when they go wrong.

Now, to reply to your comments:

> 4) Replication: to identify an object of ANY type (record of any table
> regardless to datamodel), to store lists of deleted or modified objects
> (regardless to datamodel)

Only if the Replication product requires it. Note that our new replication
engine, Slony-I, does *not* require surrogate keys; it can handle
multi-column primary keys. I'm not sure about the other replication
products, they may be able to as well.

I think what you're talking about is a GUID (Global Unique Identifier). Once
again, while most GUIDs are based on random number theory, the ideal GUID
would be a unique combination of the Row Key, the version, and the server
identification -- making it "real data" instead of a random number that tells
you nothing about the row. For performance, this combination might be
combined in a hash.

In other words, I think your 4th point is actually part of Point (1),
convenience.

> 5) Making a primary key: if there is no real key at all.

No, no, no, a thousand times NO.

Data without a real key is NOT DATA. It is just garbage.

I manage or consult on more than twenty production databases for my clients.
Every single one of those databases has a real unique key on every single
table. The ONLY exception is "holding tables" for incoming bulk data, some
of which will be retained as data and some of which will discarded as garbage
-- and key uniqueness is the primary test of what is and isn't data.

> the sentence (5) is debatable. in theory every relation has a real key, but
> in the practice we have historical datamodels without PK !!! it is
> impossible but it exists.

Only because you have bad historical databases. This is not an excuse, it is
a reason to fix them.

> For example:
> Here in Russia we have a lot of different but identically named streets
> within one city. They has absoluetly identical attributes. Historically
> only human not machines work on that datamodel and they all used to call
> such streets by name adding some spechial non-formal explainations, for
> example:
> "deliver this message please to the house 35 on the Green street, that is
> to the west of the center of the city."
> "deliver this message please to the house 12 on the Green street, that is
> shortest of the all Green streets in the town."

This is a PERFECT example of the evil of surrogate keys. You had a data
problem, and instead of solving it, you chose to lean on surrogate keys like
a crutch. Here's what you have in your database now:

Streets
ID Street Name
345 Green Street
2019 Green Street
5781 Green Street
Key: ID???

How, exactly, do you expect to distinguish the above 3 "green streets" from
each other? How do you expect to your users to know that 345 is West Green
Street and 5781 is the shortest in town? How are they supposed to choose
between the three? How do you know there are actually 3 and maybe not just
two or even one? The ID 5781 isn't exactly informative; in fact, it's
meaningless. It's GARBAGE.

Look, real data, on a row-by-row basis, is a POSTULATE. It is should
describe adequately the portion of the world you are trying to represent.
Automated surrogate keys are NOT part of these postulates since they convey
no meaningful information. This is what you currently have:

There is a street named GREEN STREET.
There is a street named GREEN STREET.
There is a street named GREEN STREET.

As you can see by rendering it in English, those three statements are damned
confusing. Are there three streets named Green Street, or is there only
one? Either could be true. We don't know, the data doesn't tell us.

What you need is:

Streets
ID Street Name Location
345 Green Street West Side of City
2019 Green Street In Front of Consulate
5781 Green Street Shortest in Town
Key: Street Name, Location

This gives you much more meaningful data:

There is a street named GREEN STREET which is on the WEST SIDE OF THE CITY.
There is a street named GREEN STREET which is IN FRONT OF THE CONSULATE.
There is a street named GREEN STREET which is the SHORTEST IN TOWN.

This tells the user which Green Street he wants to select. It also gives us
a fair indication that there are 3 Green Streets, and not one or two. And if
there is duplication -- for example, if the street in front of the consulate
is also the shortest in town -- then we can determine this and correct it in
a minute or less by looking at a map.

I'm not just picking on you. I'm saying this because it's happened to me and
was a costly error for both me and the client.

Several years ago, we were doing a scheduling database, and the client messed
around with the data model a lot. In the process, we mistakenly eliminated
the Events table's (the largest and most important table in the database)
real unique key. But because we were "leaning" on a surrogate key, we
didn't notice until the database was in production.

Then, after 6 months of scheduling, the client began to suspect that duplicate
Events were creeping into the database. But without a real, unique key, we
discovered that we could not figure out whether or not two events with
similar information were duplicates or not! The Event_ID told us nothing.
This was a real, critical problem because each Event required the attendance
of an attorney and there were never enough attorneys to go around.

--
Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Gregory S. Williamson 2004-07-21 19:34:16 Re: Sorry too many conecctions
Previous Message Geoff Richards 2004-07-21 15:28:19 Re: MySQL-style "create temporary table foo select ..."