Re: pervasiveness of surrogate (also called synthetic) keys

From: David Johnston <polobo(at)yahoo(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Rob Sargent <robjsargent(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: pervasiveness of surrogate (also called synthetic) keys
Date: 2011-05-04 04:00:06
Message-ID: B6BA5655-A7DF-4208-B3FA-5F40B5519C7F@yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On May 3, 2011, at 22:03, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> Merlin Moncure wrote:
>> If your data modeler that made the the assumptions that a MAC is
>> unique (a mistake obviously) at least the other tables are protected
>> from violations of that assumption because the database would reject
>> them with an error, which is a Good Thing. Without a uniqueness
>> constraint you now have ambiguous data which is a Very Bad Thing.
>>
>
> With a uniqueness constraint in this situation, the unexpected data--row with a non unique MAC--will be rejected and possibly lost when the insertion happens. You say that's a good thing, plenty of people will say that's the worst possible thing that can happen. When dealing with external data, it's often impossible to know everything you're going to see later at design time. Approaching that problem with the idea that you're going to lose any data that doesn't fit into the original model is not what everyone finds reasonable behavior.
>
> I don't think it's possible to decide in a generic way which of these is the better approach: to reject unexpected data and force the problem back at the application immediately (commit failure), or to accept with with because you're using a surrogate key and discover the problems down the line. Both are valid approaches with a very different type of risk associated with them. I think it's fair to say that real-world data is not always well known enough at design time to follow the idea you're suggesting though, and that does factor into why there is such a preference for surrogate keys in the industry.
>
> --
> Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

If you implicitly expect MAC to be unique but do not enforce it then you are likely to write queries that rely upon that uniqueness. If you do enforce it then the constraint failure occurs anyway.

A scalar sub-query that links via the MAC will fail when the duplicate data is encountered, and normal queries will return too-many records. A detail report may be obvious but if you are summarizing the data the specific offending record is going to require some effort to find.

I guess if you are the compromising type you can loosely enforce the uniqueness by running a check query periodically to see if supposedly unique values have been duplicated.

I agree there is no right answer - the designer needs to make trade-offs - but I'd rather reject new data and leave the system in a status-quo stable state instead of introducing invalid data and putting the system into a state where it requires effort to get it functioning again. If you accept the invalid data the likely scenario, if something breaks, is someone finds the offending record and removes it until the application and database can be fixed properly - which is where we are at with validation. The common exception is where identifiers are reused over time and you remove the old record in order to keep/allow the newer record to remain.

On a tangential course I've started considering is a setup whereby you basically have two identifiers for a record. One is end-user facing and updatable whereas the other is static and used in intra-table relations. You can create a new record with the same user-facing id as an existing Id but the existing Id will be replaced with its system id. This is useful when users will be using the Id often and it can be reasonably assumed to be unique over a moderate period of time (say a year). Invoice numbers, customer numbers are two common examples. The lookup Id itself may require additional fields in order to qualify as a primary (natural) key but the static key wants to be a single field. Often simply putting a date with the original id (and parent identifiers) is sufficient due to the infrequency of updates. The downside is, with string-based parent identifiers the pk value can be quite long. I currently have PKs of 40-50 length but during my new design my first pass on a couple of tables indicated >100 characters limit.

Is there any rules-of-thumb on the performance of a PK as a function of key length? I like using varchar based identifiers since I tend to query tables directly and writing where clauses is much easier if you can avoid the joins. I'm likely better off creating views and querying those but am still curious on any basic thoughts on having a 100+ length primary key.

David J.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tian Luo 2011-05-04 04:17:02 "full_page_writes" makes no difference?
Previous Message Greg Smith 2011-05-04 02:12:25 Re: pervasiveness of surrogate (also called synthetic) keys