Skip site navigation (1) Skip section navigation (2)

Re: Basic Q on superfluous primary keys

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>, Kynn Jones <kynnjo(at)gmail(dot)com>
Subject: Re: Basic Q on superfluous primary keys
Date: 2007-04-18 19:08:44
Message-ID: 1176923324.4152.145.camel@dogma.v10.wvs (view raw or flat)
Thread:
Lists: pgsql-performance
On Tue, 2007-04-17 at 21:06 -0700, Craig A. James wrote:
> Merlin Moncure wrote:
> > In the context of this debate, I see this argument all the time, with
> > the implied suffix: 'If only we used integer keys we would not have
> > had this problem...'.  Either the customer identifies parts with a
> > part number or they don't...and if they do identify parts with a
> > number and recycle the numbers, you have a problem...period.
> 
> On the contrary.  You create a new record with the same part number.  You mark the old part number "obsolete".  Everything else (the part's description, and all the relationships that it's in, such as order history, catalog inclusion, revision history, etc.) is unaffected.  New orders are placed against the new part number's DB record; for safety the old part number can have a trigger that prevent new orders from being placed.
> 
> Since the part number is NOT the primary key, duplicate part numbers are not a problem.  If you had instead used the part number as the primary key, you'd be dead in the water.
> 
> You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree.  But they do it, and as database designers we have to handle it.  In my particular system, we aggregate information from several hundred companies, and this exact scenario happens frequently.  Since we're only aggregating information, we have no control over the data that these companies provide.  If we'd used catalog numbers for primary keys, we'd have big problems.
> 

Storing data is easy.

The difficulty lies in storing data in such a way that your assumptions
about the data remain valid and your queries still answer the questions
that you think they answer.

Because an internal ID field has no meaning outside of the database
(some auto-generated keys do have meaning outside the database, but I'm
not talking about those), you can't effectively query based on an
internal id any more than you can query by the ctid. So what do you
query by then? You query by natural keys anyway. Internal id fields are
an implementation detail related to performance (the real topic of this
discussion).

If you have two parts with the same part id, what does that mean? Sure,
you can store the data, but then the queries that assumed that data was
unique no longer hold. Sometimes you need two parts with the same part
id, but you have to know the meaning in order to query based on that
data.

Let me ask these questions:
 - Do you think that all of your relations have an internal id? 
 - Do you think that all the internal ids you use are unique in the
relations in which they appear?

If you answer "yes" to either question, consider that every query on
that data is also a relation and so are subselects and intermediate
results. Do those all have an id? If not, why not? How do you join a
virtual relation to a physical relation if the virtual relation has no
internal id? Is the id field still unique in the result of a join or
Cartesian product?

Regards,
	Jeff Davis



In response to

pgsql-performance by date

Next:From: Hannes DorbathDate: 2007-04-18 20:04:56
Subject: Re: Equivalents in PostgreSQL of MySQL's "ENGINE=MEMORY" "MAX_ROWS=1000"
Previous:From: Jim C. NasbyDate: 2007-04-18 18:25:53
Subject: Re: Long running transactions again ...

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group