Re: pervasiveness of surrogate (also called synthetic) keys

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pervasiveness of surrogate (also called synthetic) keys
Date: 2011-05-02 19:21:39
Message-ID: BANLkTikqs+GrGKbrfG+q_k0GpH4XRQH4MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, May 2, 2011 at 10:10 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 05/01/2011 06:12 PM, Karsten Hilbert wrote:
>>
>> Good to know since I'm only a lowly medical doctor not
>> having much schooling in database matters beyond this list,
>> the PostgreSQL docs, and the Celko book.
>>
>
> This debate exists at all levels of experience, and the only thing that
> changes as you get more experienced people involved is an increase in
> anecdotes on each side.  The sole time I ever found myself arguing with Joe
> Celko is over an article he wrote recommending natural keys, using an
> example from the automotive industry.  Problem was, the specific example he
> gave was flat out wrong.  I was working in automotive MIS at the time, and
> the thing he was saying would never change did, in fact, change every
> year--in only a fraction of a percent of cases, in an extremely subtle way
> that snuck up on people and wreaked much confusion.  That's typical for an
> early natural key design:  you get it working fine in V1.0, only to discover
> months or years down the road there's a case you never considered you don't
> model correctly, and it may take some sort of conversion to fix.
>
> The reason why there's a strong preference for surrogate keys is that they
> always work and you can avoid ever needing to come up with a better design.
> if you just use them and forget about it.  The position Merlin has advocated
> here, that there should always be a natural key available if you know the
> data well enough, may be true.  But few people are good enough designers to
> be sure they've made the decision correctly, and the downsides of being
> wrong can be a long, painful conversion process.  Easier for most people to
> just eliminate the possibility of making a mistake by using auto-generated
> surrogate keys, where the primary problem you'll run into is merely using
> more space/resources than you might otherwise need to have.  It minimizes
> the worst-case--mistake make in the model, expensive re-design--by adding
> overhead that makes the average case more expensive.  Software design
> usually has enough risks that any time you can eliminate one just by
> throwing some resources at it, that's normally the right thing to do.

There are many practical arguments advocating the use of surrogate
keys. Faster updates, easier schema changes, performance, maintenance
costs down the line, better tools integration, etc. These arguments
basically involve trade-offs that can be justifiably be used to push
you one way or the other. That's all well and good.

However, I tend to disagree with arguments that you are better off not
identifying natural keys at all. To my mind, any database that has a
table with no discernible key but whose records are referred to via
another table's foreign key has a schema that is in a State of Error.
A surrogate key is just added information to the database -- why does
it play that record X out of N identical candidates should be the one
mapped? Is that logic repeatable? What are the ramifications for
joins that do not flow through the ID columns? Typically what it means
is that the rules that guard against duplicate information entry are
not, in fact in the database at all but in the application, and bad
data can now get into your database by a much broader array of causes.
The last and best defense against a nasty and common class of data
errors has been removed. The more complex and your database is, the
more it tends to be used a by a large array of clients, possibly even
spanning multiple computer languages -- thus the need for a root
system of constraint checking that is declarative and easily
understood.

Sure, requirements change, models change, but at any particular point
and time a model with as little as possible (read: none) outside
inputs should be able to provably demonstrate verifiable facts. With
a natural key database (or a surrogate key database with defined keys
that are not used for RI) you have inherent constraint checking that a
purely surrogate database simply doesn't have. Whatever the software
maintenance costs are, which is itself a complex and debatable topic,
I'll go with a strategy that gives a better defense against bad or
ambiguous data.

merlin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adam Bruss 2011-05-02 20:16:11 Pass age function to extract function
Previous Message Jim Irrer 2011-05-02 18:20:59 Re: pervasiveness of surrogate (also called synthetic) keys