Referential Integrity Tutorial & Hacking Referential Integrity:


Referential integrity is the feature of a database that ensures that
implied relationships in a database are enforced. It is a feature of
most database systems, and protects users from accidentally (or
intentially!) creating discrepencies in their database.

In this chapter, we will work with a sample set of data involving
people, tasks they are responsible for, and appointments you have
scheduled with them. As you fire employees, you delete them from your
person table, but want to be certain that you can't leave tasks that
no one is in charge of, or leave appointments where the persons name
cannot be located.

(If your not familiar with the concepts of Primary Keys and Foreign
Keys, it might be helpful to review a basic database primer, as this
tutorial assumes basic knowledge of these concepts.)


First, because we're going playing with the system tables, it's best to do
this in a practice database:

# CREATE DATABASE RI_test;
# \c RI_test

Now, let's create practice tables. We'll have one parent table, p, and
two child tables, c1 and c2.

The parent table is straightforward:

# CREATE TABLE Pers (pid INT NOT NULL PRIMARY KEY, 
                     pname TEXT NOT NULL);

The first child table, which will hold tasks for which this person is
responsible. 

# CREATE TABLE Tasks (taskid SERIAL NOT NULL PRIMARY KEY,
                      pid INT NOT NULL
                        CONSTRAINT tasks__ref_p
                          REFERENCES Pers,
		      task TEXT NOT NULL);

A few notes:

* foreign key references are handled by PostgreSQL as a type of
CONSTRAINT. Therefore, we can begin the referential declaration with
CONSTRAINT <name of constraint>. If this is done, the error messages
generated by PostgreSQL will return this error name (which the user
could helpfully see, or which we could parse and handle in a front-end
system.) By wrapping it in system identifiers ("), we could even make
it look like an error message (eg CONSTRAINT "Table tasks refers to table
Pers"), but we are limited in the length of this name/message, and
therefore it's usefulness. CONSTRAINT names are not neccessary, and we
could leave this off: ...pid INT NOT NULL REFERENCES pers..., in which
case everything works the same, except our constraint is unnamed, and
therefore the error messages are more generic.

* after REFERENCES comes the name of the table we are referring to,
and (optionally), the name of the field in parentheses. If the field
is the primary key, the fieldname is optional. If we had not made pid
table Pers's primary key, we would have to say REFERENCES Pers(pid).

* optionally, we can inform PostgreSQL how to handle deletes/updates
on table Pers by adding ON DELETE and ON UPDATE declarations. If these
are not made specific, then PostgreSQL defaults to "no action", which
(at the time of this writing, 12/2000) means the same thing as
"restrict"

[Ed: my digging into the source code shows that they're treated the
same, though there is a comment about SQL3 requiring that it be
trapped before... I'm not sure what this is referring to, exactly, but
it still seems true that the mean the same. Is this accurate?]

, which means that we could neither delete a parent nor
update the parent pid field if a child existed that depended on that
parent. In our other child table, we'll see another way to handle
this.

* optionally, we could tell PostgreSQL if this is "deferrable", that
is, if we are allowed to violate referential integrity while inside of
some transaction as long as we have fixed our integrity problem at the
end of the transaction. We'll see examples of this below, in
DEFERRING.

So, let's add some sample data:

INSERT INTO Pers VALUES (1, 'Jeff Brown');
INSERT INTO Pers VALUES (2, 'Maria Lane');

INSERT INTO Tasks (pid, task) VALUES (1, 'Write contract');
INSERT INTO Tasks (pid, task) VALUES (1, 'Upgrade database');

So far so good.

If we try to insert a child that has no parent, eg

INSERT INTO Tasks (pid, task) VALUES (3, 'Install Linux');

we'll get a referential integrity error. This error will refer to our
constraint name (if any), and will block the entering of this data
into Tasks. (In addition, if this is part of a transaction, it will
rollback the entire transaction, as always happens in PostgreSQL).

Also, if we try to change a parent that has children:

DELETE FROM Pers WHERE pname = 'Jeff Brown';

We'll get the same problem, as we cannot delete Jeff as long
as he has his two tasks assigned.

DELETE FROM Pers WHERE pname = 'Maria Lane';

(if you actually deleted Maria, add her back in for our later
examples.)

because there are currently no child tasks for Maria.

Default (NO ACTION) (or RESTRICT, which means the same thing to
PostgreSQL) ON UPDATE and ON DELETE rules make sense for many
situations like this. You wouldn't want to be able to delete a staff
person from your database if they had certain resposibilities,
otherwise you would never know who was in charge of different tasks.


A More Permissive Child Table: Appointments

Let's create a second child, Appts (for Appointments):

# CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY,
                      pid INT NOT NULL
                        CONSTRAINT appt__ref_pers
                          REFERENCES Pers
                            ON UPDATE CASCADE
                            ON DELETE CASCADE
                            DEFERRABLE,
                      apptsubj TEXT NOT NULL,
                      apptdate DATE NOT NULL);

and insert some data into it:

INSERT INTO Appts (pid, apptsubj, apptdate) 
  VALUES (1, 'Discuss raise', '2001-01-15');
INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (2, 'Plan project', '2001-01-20');

This works just fine. Also, just as before,

INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (3, 'Termination', '2001-01-20');

does not work, as there is no Pers with pid=3.

However, unlike the Tasks table, Appts is set to CASCADE deletes and
updates. This means that if we update a person's pid, or delete a
person entirely, PostgreSQL will allow this by first deleting any
Tasks that exist for this person.

So:

UPDATE Pers SET pid=14 WHERE pid=2;

This will change Maria's pid to 14, first changing the connecting pid
in the Appts table.

DELETE FROM Pers WHERE pid=14;

This will delete Maria's appointment, then delete Maria.

Other Relationships:

Note that CASCADE in this class only refers to the CASCADE between the
Pers and Appt tables. Trying to DELETE Jeff would still fail because,
although the Pers-Appt relationship would CASCADE, the Pers-Tasks
relationship would fail, and PostgreSQL would report this DELETE
attempt as a violation of that referential integrity.

Possible actions:

NO ACTION: (the default). Stop the action if an update/delete would
fail referential integrity checking.

RESTRICT: currently means same thing as NO ACTION.

CASCADE: delete child data, then delete parent data (subject to other
relationships, as noted above)

SET NULL: set foreign key field to NULL, then update/delete parent.

SET DEFAULT: set foreign key field to DEFAULT value, then
update/delete parent.


SET NULL and SET DEFAULT can be useful options (especially for ON
DELETE). For example, we might have a table, Offices, that kept track
of which office a worker used. It might contain fields for officeid,
officelocation, and persid. If we wanted to delete a person, we
shouldn't be stopped just because that person has a related office,
but similarly, we don't want to delete an office just because this
person is being deleted. In some cases, the best option might be to
set the persid field for Offices to NULL (or DEFAULT), leaving the
office in place, but making it clear that this office is now unused.

Note that ON UPDATE and ON DELETE can have different rules. It's very
common, for instance to ON UPDATE CASCADE but ON DELETE
RESTRICT--allow people to change their person IDs, but not allow
deleting of Tasks if a related person exists.

It is possible to change the actions for a relationship, but requires
a little hacking in the system catalog tables. See Hacking the
Relationships, below.


Deferring

By default, referential integrity is checked for every single
relationship, for every single insert, delete, update that could
affect this relationship. This means that

INSERT INTO Tasks (pid, task) VALUES (5, 'Open sales office');
INSERT INTO Person (pid, name) VALUES (5, 'Helen Kim');

would fail, because at the time of the first attempted insert, there
is no person with pid=5. Much of the time, this is the most intuitive
setting.

Sometimes, however, you may not be able to predict the exact order
data arrives. Perhaps you are receiving data loaded from a text file,
or across the web. It's possible that the data may arrive in the order
above (task, then associated person). When this happens, you can
choose to defer the transaction checking.

In order to defer a transaction, two things must happen:

1) the referential integrity relationship must have been defined as
DEFERRABLE. This is not the default (NOT DEFERRABLE is), so it must be
declared explicity, as we did for Appts, above.

2) you must be in an explicit transaction.

3) you must either have DEFERRED be the initial default for this
relationship, or have SET the relationship constraint to DEFERRED for
this transaction.

For our Appts table relationship to Pers, we have declared this is as
deferred. So, if we do

BEGIN;
SET CONSTRAINTS ALL DEFERRABLE;
INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (5, 'Negotiation', '2001-01-20');
INSERT INTO Pers VALUES (5, 'Helen Kim');
COMMIT;

works just fine. Note that having the relationship declared as
DEFERRABLE is not enough--we must also use SET to explicity set
CONSTRAINTS to deferred. In this example, we set all relationships to
DEFERRED; instead we could set only a single constraint to deferred,
as in

SET CONSTRAINTS appts__ref_pers DEFERRABLE;

It may be convenient to have a relationship already set, rather than
having to set this for every transaction. To do this, add "INITIALLY
DEFERRED" to the CONSTRAINT ... REFERENCES declaration for the table,
eg

# CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY,
                      pid INT NOT NULL
                        CONSTRAINT appt__ref_pers
                          REFERENCES Pers
                            ON UPDATE CASCADE
                            ON DELETE CASCADE
                            DEFERRABLE
			    INITALLY DEFERRABLE,
                      apptsubj TEXT NOT NULL,
                      apptdate DATE NOT NULL);

Now, we could simply

BEGIN;
INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (5, 'Negotiation', '2001-01-20');
INSERT INTO Pers VALUES (5, 'Helen Kim');
COMMIT;

With no explicit SET command. This is especially convenient if you
work in a programming setting that abstracts SQL commands and makes it
difficult to execute an arbitrary, nonstandard SQL standard such as
"SET...".

[Ed: I assume that deferred could be a bit faster for large inserts,
too, since the subroutine would be exited out of quickly until the
very last check. Is there enough time difference/can anyone verify
this?]


Hacking Referential Integrity

Referential integrity works great in PostgreSQL. However, PostgreSQL
does not (yet) have SQL-synax commands to change actions, turn on/off
RI, etc. However, as this information is stored in the system catalog
tables, it can be performed by editing these tables directly.

WARNING: to edit the system catalog tables, you must be a superuser in
PostgreSQL. In addition, you should be VERY CAREFULLY when editing
these tables, and make sure that you have a backup first (via
pg_dump). An accidental table-wide UPDATE or DELETE could delete all
of your tables, ruins your indexes, corrupt your database,
etc. Practice this by working in a test database, preferrably even on
a machine without any other critical databases.

There are several system catalog tables of interest to us:

pg_class  all "classses", included tables, views, sequences, etc.
pg_trigger    all triggers. PostgreSQL handles referential integrity
using behind-the-scenes triggers, so this is where all of your RI
controls are stored.
pg_proc	 all PostgreSQL procedures. We won't need to make any
changes to this, but will use this to show what the RI procedures used
actually are.

To see (most) of the system tables, you can use the command \dS in
psql.

For example, for our test tables, let's collect information from
pg_class about our tables.

# SELECT oid, relname FROM pg_class WHERE relname IN
('pers','appts','tasks');

Note that PostgreSQL usually downcases non-quoted system identifiers
like tables for us automatically (that is, I can CREATE TABLE foo but
SELECT * FROM FOO.) However, when examining pg_class, you must work in
a case-sensitive manner, or use case-insensitive operators.

This query returns

  oid | relname
--------+---------
 9100 | pers
 9110 | tasks
 9120 | appts
(3 rows)

(Your OIDs will be different. Don't worry, just notice what they are.)

Now, if we look in pg_trigger, we can find the triggers that are used
by our tables.

test2=# SELECT * from pg_trigger WHERE tgrelid in
(9100, 9110, 9120);

 tgrelid |           tgname            | tgfoid | tgtype | tgenabled |
tgisconstraint |  tgconstrname  | tgconstrrelid | tgdeferrable |
tginitdeferred | tgnargs | tgattr |
tgargs
---------+-----------------------------+--------+--------+-----------+----------------+----------------+---------------+--------------+----------------+---------+--------+------------------------------------------------------------------
  263721 | RI_ConstraintTrigger_263752 |   1644 |     21 | t         |
t              | tasks__ref_p   |        263674 | f
      | f              |       6 |        |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pid\000
  263674 | RI_ConstraintTrigger_263754 |   1654 |      9 | t         |
t              | tasks__ref_p   |        263721 | f
      | f              |       6 |        |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pid\000
  263674 | RI_ConstraintTrigger_263756 |   1655 |     17 | t         |
t              | tasks__ref_p   |        263721 | f
      | f              |       6 |        |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pid\000
  263776 | RI_ConstraintTrigger_263808 |   1644 |     21 | t         |
t              | appt__ref_pers |        263674 | t
      | f              |       6 |        |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000pid\000
  263674 | RI_ConstraintTrigger_263810 |   1646 |      9 | t         |
t              | appt__ref_pers |        263776 | t
      | f              |       6 |        |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000pid\000
  263674 | RI_ConstraintTrigger_263812 |   1647 |     17 | t         |
t              | appt__ref_pers |        263776 | t
      | f              |       6 |        |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000pid\000
(6 rows)

The columns in pg_trigger are:

tgrelid	    What table is this trigger on. That is, which table is
inserted/updated/deleted that calls that this trigger.
tgname			 Name of this trigger. Trigger names are
generated by the referential integrity feature of PostgreSQL are are
unimportant, except that they are unique.
tgfoid	     OID of the function that is called. This is an important
column--by changing the function called, we can change which action
(CASCADE, etc.) is performed. (See below).
tgtype	  What kind of trigger is this (is it UPDATE, DELETE, etc.)
tgenabled Is this trigger enabled?
tgisconstraint	  Is this trigger part of a constraint. Non-RI
triggers that are user-declared by CREATE TRIGGER may be FALSE for
this, but RI triggers will always be true.
tgconstrname Name of constraint that calls this trigger. If you named
your constraints (as we did, eg pers__ref_tasks), this will be the
constraint name, otherwise it will be unnamed.
tgconstrrelid	 OID of table that which had the constraint set.
tgdeferrable	 Can this constraint be deferred? (Equivalent to
DEFERRABLE. See above.)
tginitdeferred	Is this constraint initially deferred? (Equivalent to
INITIALLY DEFERRED. See above.)
tgnargs	  Number of arguments for the RI function. As of PostgreSQL
7.1, this always seems to be 6, and should not be edited.
tgattr	  ???
tgargs	  The actual arguments to the RI function. 

To help make this more understandable and usable, a view is helpful:

CREATE VIEW dev_ri
AS 
SELECT   t.oid as trigoid,
          c.relname as trig_tbl,
          t.tgfoid, 
          f.proname as trigfunc,
          t.tgenabled,
          t.tgconstrname,
          c2.relname as const_tbl,
          t.tgdeferrable,
          t.tginitdeferred
FROM      pg_trigger t,
          pg_class c, 
          pg_class c2, 
          pg_proc f 
WHERE     t.tgrelid=c.oid 
 AND      t.tgconstrrelid=c2.oid 
 AND      tgfoid=f.oid
 AND      tgname ~ '^RI_'
ORDER BY  t.oid;

(This view requires PostgreSQL 7.1 because of the ORDER BY. For
versions earlier than 7.1, you can remove the ORDER BY.)

# select * from dev_ri;

 trigoid | trig_tbl | tgfoid |       trigfunc       | tgenabled |
tgconstrname  | const_tbl | tgdeferrable | tginitdeferred
---------+----------+--------+----------------------+-----------+----------------+-----------+--------------+----------------
  263753 | tasks    |   1644 | RI_FKey_check_ins    | t         |
tasks__ref_p   | pers      | f            | f
  263755 | pers     |   1654 | RI_FKey_noaction_del | t         |
tasks__ref_p   | tasks     | f            | f
  263757 | pers     |   1655 | RI_FKey_noaction_upd | t         |
tasks__ref_p   | tasks     | f            | f
  263809 | appts    |   1644 | RI_FKey_check_ins    | t         |
appt__ref_pers | pers      | t            | f
  263811 | pers     |   1646 | RI_FKey_cascade_del  | t         |
appt__ref_pers | appts     | t            | f
  263813 | pers     |   1647 | RI_FKey_cascade_upd  | t         |
appt__ref_pers | appts     | t            | f
(6 rows)

Now, it's much easier to understand what's happening. For example,
from trigger oid=263753, we can see that on inserts to the Tasks
table, RI_FKey_check_ins is called, which checks the Pers table.

From here, we can:

1) Temporarily disable a trigger. If you want to load lots of data,
and not slow down with referential integity checks (and without having
to deal w/transaction-level deferrment):

UPDATE pg_trigger SET tgenabled=FALSE WHERE oid=xxx;

(get the trigoid column from our view to find which trigger you want
to affect.)

[pg_dump files do this so that table data can be inserted in any
order, without having to worry about RI rules.]

2) Make a trigger DEFERRABLE, if it orginally wasn't, or vice-versa:

UPDATE pg_trigger SET tgdeferrable=[ TRUE | FALSE ] WHERE oid=xxx;

Or, to make a trigger INITIALLY DEFERRED (or turn this off):

UPDATE pg_trigger SET tginitdeferred=[ TRUE | FALSE ] WHERE oid=xxx;

3) Change the action for a trigger.

If you've created a trigger with an action (or with the default NO
ACTION action), you can change your mind by changing the function
called. To get the list of all RI trigger functions:

SELECT oid, proname FROM pg_proc where proname ~ '^RI_';
 oid  |        proname
------+------------------------
 1646 | RI_FKey_cascade_del
 1647 | RI_FKey_cascade_upd
 1644 | RI_FKey_check_ins
 1645 | RI_FKey_check_upd
 1654 | RI_FKey_noaction_del
 1655 | RI_FKey_noaction_upd
 1648 | RI_FKey_restrict_del
 1649 | RI_FKey_restrict_upd
 1652 | RI_FKey_setdefault_del
 1653 | RI_FKey_setdefault_upd
 1650 | RI_FKey_setnull_del
 1651 | RI_FKey_setnull_upd

(Your OIDs will probably be different. Note and use your own.)

In our example, to set updates on Pers(pid) to CASCADE, rather than NO
ACTION on Pers-Tasks,

UPDATE pg_trigger SET tgfoid=1647 WHERE oid=xxx;

(where xxx is our current noaction_upd trigger for Pers-Tasks)


Notice that these changes often require a new backend. Quit and
restart psql, or reset your client connection, and you should be able
to test out your new settings.