Skip site navigation (1) Skip section navigation (2)

Proposed tutorial or chapter on RI/hacking RI in catalog

From: "Joel Burton" <jburton(at)scw(dot)org>
To: pgsql-docs(at)postgresql(dot)org
Subject: Proposed tutorial or chapter on RI/hacking RI in catalog
Date: 2000-12-03 22:29:20
Message-ID: 3A2A82F0.24487.D98117@localhost (view raw or flat)
Thread:
Lists: pgsql-docs
[This is a report, since the original post never appeared. My 
apologies if it's a duplicate for you.]


I've written the start of a tutorial on how to use the RI features of
Pgsql, and how to alter the system catalogs to change actions,
deferrability, etc. Much of the information is already in the docs, but
spread around different SQL reference sections, etc., so I thought it
would be useful in one place.

It's not in Docbook; I want to see first what structural changes 
people
can recommend, etc., before it gets marked up.

It's still quite bit rough, so I'd hold off on grammar/spelling/etc., but 
I would be interested if people have (a) tips not in it, (b) factual 
errors, or (c) ideas about where it might fit into the docs.

I do read this group, but I always like a cc: to make sure I don't miss
anything: jburton(at)scw(dot)org(dot)

Thanks!




Referential Integrity Tutorial & Hacking Referential 
Integrity:


Referential integrity is the feature of a database that 
ensures that
implied relationships in a database are enforced. It is 
a feature of
most database systems, and protects users from 
accidentally (or
intentially!) creating discrepencies in their database.

In this chapter, we will work with a sample set of data 
involving
people, tasks they are responsible for, and appointments 
you have
scheduled with them. As you fire employees, you delete 
them from your
person table, but want to be certain that you can't 
leave tasks that
no one is in charge of, or leave appointments where the 
persons name
cannot be located.

(If your not familiar with the concepts of Primary Keys 
and Foreign
Keys, it might be helpful to review a basic database 
primer, as this
tutorial assumes basic knowledge of these concepts.)


First, because we're going playing with the system 
tables, it's best to do
this in a practice database:

# CREATE DATABASE RI_test;
# \c RI_test

Now, let's create practice tables. We'll have one parent 
table, Pers, and
two child tables, Tasks and Appts.

The parent table is straightforward:

# CREATE TABLE Pers (pid INT NOT NULL PRIMARY KEY, 
                     pname TEXT NOT NULL);

The first child table, which will hold tasks for which 
this person is
responsible. 

# CREATE TABLE Tasks (taskid SERIAL NOT NULL PRIMARY KEY,
                      pid INT NOT NULL
                        CONSTRAINT tasks__ref_p
                          REFERENCES Pers,
		      task TEXT NOT NULL);

A few notes:

* foreign key references are handled by PostgreSQL as a 
type of
CONSTRAINT. Therefore, we can begin the referential 
declaration with
CONSTRAINT <name of constraint>. If this is done, the 
error messages
generated by PostgreSQL will return this error name 
(which the user
could helpfully see, or which we could parse and handle 
in a front-end
system.) By wrapping it in system identifiers ("), we 
could even make
it look like an error message (eg CONSTRAINT "Table 
tasks refers to table
Pers"), but we are limited in the length of this 
name/message, and
therefore it's usefulness. CONSTRAINT names are not 
neccessary, and we
could leave this off: ...pid INT NOT NULL REFERENCES 
pers..., in which
case everything works the same, except our constraint is 
unnamed, and
therefore the error messages are more generic.

* after REFERENCES comes the name of the table we are 
referring to,
and (optionally), the name of the field in parentheses. 
If the field
is the primary key, the fieldname is optional. If we had 
not made pid
table Pers's primary key, we would have to say 
REFERENCES Pers(pid).

* optionally, we can inform PostgreSQL how to handle 
deletes/updates
on table Pers by adding ON DELETE and ON UPDATE 
declarations. If these
are not made specific, then PostgreSQL defaults to "no 
action", which
(at the time of this writing, 12/2000) means the same 
thing as
"restrict"

[Ed: my digging into the source code shows that they're 
treated the
same, though there is a comment about SQL3 requiring 
that it be
trapped before... I'm not sure what this is referring 
to, exactly, but
it still seems true that the mean the same. Is this 
accurate?]

, which means that we could neither delete a parent nor
update the parent pid field if a child existed that 
depended on that
parent. In our other child table, we'll see another way 
to handle
this.

* optionally, we could tell PostgreSQL if this is 
"deferrable", that
is, if we are allowed to violate referential integrity 
while inside of
some transaction as long as we have fixed our integrity 
problem at the
end of the transaction. We'll see examples of this 
below, in
DEFERRING.

So, let's add some sample data:

INSERT INTO Pers VALUES (1, 'Jeff Brown');
INSERT INTO Pers VALUES (2, 'Maria Lane');

INSERT INTO Tasks (pid, task) VALUES (1, 'Write 
contract');
INSERT INTO Tasks (pid, task) VALUES (1, 'Upgrade 
database');

So far so good.

If we try to insert a child that has no parent, eg

INSERT INTO Tasks (pid, task) VALUES (3, 'Install 
Linux');

we'll get a referential integrity error. This error will 
refer to our
constraint name (if any), and will block the entering of 
this data
into Tasks. (In addition, if this is part of a 
transaction, it will
rollback the entire transaction, as always happens in 
PostgreSQL).

Also, if we try to change a parent that has children:

DELETE FROM Pers WHERE pname = 'Jeff Brown';

We'll get the same problem, as we cannot delete Jeff as 
long
as he has his two tasks assigned.

DELETE FROM Pers WHERE pname = 'Maria Lane';

(if you actually deleted Maria, add her back in for our 
later
examples.)

because there are currently no child tasks for Maria.

Default (NO ACTION) (or RESTRICT, which means the same 
thing to
PostgreSQL) ON UPDATE and ON DELETE rules make sense for 
many
situations like this. You wouldn't want to be able to 
delete a staff
person from your database if they had certain 
resposibilities,
otherwise you would never know who was in charge of 
different tasks.


A More Permissive Child Table: Appointments

Let's create a second child, Appts (for Appointments):

# CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY,
                      pid INT NOT NULL
                        CONSTRAINT appt__ref_pers
                          REFERENCES Pers
                            ON UPDATE CASCADE
                            ON DELETE CASCADE
                            DEFERRABLE,
                      apptsubj TEXT NOT NULL,
                      apptdate DATE NOT NULL);

and insert some data into it:

INSERT INTO Appts (pid, apptsubj, apptdate) 
  VALUES (1, 'Discuss raise', '2001-01-15');
INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (2, 'Plan project', '2001-01-20');

This works just fine. Also, just as before,

INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (3, 'Termination', '2001-01-20');

does not work, as there is no Pers with pid=3.

However, unlike the Tasks table, Appts is set to CASCADE 
deletes and
updates. This means that if we update a person's pid, or 
delete a
person entirely, PostgreSQL will allow this by first 
deleting any
Tasks that exist for this person.

So:

UPDATE Pers SET pid=14 WHERE pid=2;

This will change Maria's pid to 14, first changing the 
connecting pid
in the Appts table.

DELETE FROM Pers WHERE pid=14;

This will delete Maria's appointment, then delete Maria.

Other Relationships:

Note that CASCADE in this class only refers to the 
CASCADE between the
Pers and Appt tables. Trying to DELETE Jeff would still 
fail because,
although the Pers-Appt relationship would CASCADE, the 
Pers-Tasks
relationship would fail, and PostgreSQL would report 
this DELETE
attempt as a violation of that referential integrity.

Possible actions:

NO ACTION: (the default). Stop the action if an 
update/delete would
fail referential integrity checking.

RESTRICT: currently means same thing as NO ACTION.

CASCADE: delete child data, then delete parent data 
(subject to other
relationships, as noted above)

SET NULL: set foreign key field to NULL, then 
update/delete parent.

SET DEFAULT: set foreign key field to DEFAULT value, then
update/delete parent.


SET NULL and SET DEFAULT can be useful options 
(especially for ON
DELETE). For example, we might have a table, Offices, 
that kept track
of which office a worker used. It might contain fields 
for officeid,
officelocation, and persid. If we wanted to delete a 
person, we
shouldn't be stopped just because that person has a 
related office,
but similarly, we don't want to delete an office just 
because this
person is being deleted. In some cases, the best option 
might be to
set the persid field for Offices to NULL (or DEFAULT), 
leaving the
office in place, but making it clear that this office is 
now unused.

Note that ON UPDATE and ON DELETE can have different 
rules. It's very
common, for instance to ON UPDATE CASCADE but ON DELETE
RESTRICT--allow people to change their person IDs, but 
not allow
deleting of Tasks if a related person exists.

It is possible to change the actions for a relationship, 
but requires
a little hacking in the system catalog tables. See 
Hacking the
Relationships, below.


Deferring

By default, referential integrity is checked for every 
single
relationship, for every single insert, delete, update 
that could
affect this relationship. This means that

INSERT INTO Tasks (pid, task) VALUES (5, 'Open sales 
office');
INSERT INTO Person (pid, name) VALUES (5, 'Helen Kim');

would fail, because at the time of the first attempted 
insert, there
is no person with pid=5. Much of the time, this is the 
most intuitive
setting.

Sometimes, however, you may not be able to predict the 
exact order
data arrives. Perhaps you are receiving data loaded from 
a text file,
or across the web. It's possible that the data may 
arrive in the order
above (task, then associated person). When this happens, 
you can
choose to defer the transaction checking.

In order to defer a transaction, two things must happen:

1) the referential integrity relationship must have been 
defined as
DEFERRABLE. This is not the default (NOT DEFERRABLE is), 
so it must be
declared explicity, as we did for Appts, above.

2) you must be in an explicit transaction.

3) you must either have DEFERRED be the initial default 
for this
relationship, or have SET the relationship constraint to 
DEFERRED for
this transaction.

For our Appts table relationship to Pers, we have 
declared this is as
deferred. So, if we do

BEGIN;
SET CONSTRAINTS ALL DEFERRABLE;
INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (5, 'Negotiation', '2001-01-20');
INSERT INTO Pers VALUES (5, 'Helen Kim');
COMMIT;

works just fine. Note that having the relationship 
declared as
DEFERRABLE is not enough--we must also use SET to 
explicity set
CONSTRAINTS to deferred. In this example, we set all 
relationships to
DEFERRED; instead we could set only a single constraint 
to deferred,
as in

SET CONSTRAINTS appts__ref_pers DEFERRABLE;

It may be convenient to have a relationship already set, 
rather than
having to set this for every transaction. To do this, 
add "INITIALLY
DEFERRED" to the CONSTRAINT ... REFERENCES declaration 
for the table,
eg

# CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY,
                      pid INT NOT NULL
                        CONSTRAINT appt__ref_pers
                          REFERENCES Pers
                            ON UPDATE CASCADE
                            ON DELETE CASCADE
                            DEFERRABLE
			    INITALLY DEFERRABLE,
                      apptsubj TEXT NOT NULL,
                      apptdate DATE NOT NULL);

Now, we could simply

BEGIN;
INSERT INTO Appts (pid, apptsubj, apptdate)
  VALUES (5, 'Negotiation', '2001-01-20');
INSERT INTO Pers VALUES (5, 'Helen Kim');
COMMIT;

With no explicit SET command. This is especially 
convenient if you
work in a programming setting that abstracts SQL 
commands and makes it
difficult to execute an arbitrary, nonstandard SQL 
standard such as
"SET...".

[Ed: I assume that deferred could be a bit faster for 
large inserts,
too, since the subroutine would be exited out of quickly 
until the
very last check. Is there enough time difference/can 
anyone verify
this?]


Hacking Referential Integrity

Referential integrity works great in PostgreSQL. 
However, PostgreSQL
does not (yet) have SQL-synax commands to change 
actions, turn on/off
RI, etc. However, as this information is stored in the 
system catalog
tables, it can be performed by editing these tables 
directly.

WARNING: to edit the system catalog tables, you must be 
a superuser in
PostgreSQL. In addition, you should be VERY CAREFULLY 
when editing
these tables, and make sure that you have a backup first 
(via
pg_dump). An accidental table-wide UPDATE or DELETE 
could delete all
of your tables, ruins your indexes, corrupt your 
database,
etc. Practice this by working in a test database, 
preferrably even on
a machine without any other critical databases.

There are several system catalog tables of interest to 
us:

pg_class  all "classses", included tables, views, 
sequences, etc.
pg_trigger    all triggers. PostgreSQL handles 
referential integrity
using behind-the-scenes triggers, so this is where all 
of your RI
controls are stored.
pg_proc	 all PostgreSQL procedures. We won't need to 
make any
changes to this, but will use this to show what the RI 
procedures used
actually are.

To see (most) of the system tables, you can use the 
command \dS in
psql.

For example, for our test tables, let's collect 
information from
pg_class about our tables.

# SELECT oid, relname FROM pg_class WHERE relname IN
('pers','appts','tasks');

Note that PostgreSQL usually downcases non-quoted system 
identifiers
like tables for us automatically (that is, I can CREATE 
TABLE foo but
SELECT * FROM FOO.) However, when examining pg_class, 
you must work in
a case-sensitive manner, or use case-insensitive 
operators.

This query returns

  oid | relname
--------+---------
 9100 | pers
 9110 | tasks
 9120 | appts
(3 rows)

(Your OIDs will be different. Don't worry, just notice 
what they are.)

Now, if we look in pg_trigger, we can find the triggers 
that are used
by our tables.

test2=# SELECT * from pg_trigger WHERE tgrelid in
(9100, 9110, 9120);

 tgrelid |           tgname            | tgfoid | tgtype 
| tgenabled |
tgisconstraint |  tgconstrname  | tgconstrrelid | 
tgdeferrable |
tginitdeferred | tgnargs | tgattr |
tgargs
---------+-----------------------------+--------+--------
+-----------+----------------+----------------+----------
-----+--------------+----------------+---------+--------
+--------------------------------------------------------
----------
  263721 | RI_ConstraintTrigger_263752 |   1644 |     21 
| t         |
t              | tasks__ref_p   |        263674 | f
      | f              |       6 |        |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pi
d\000
  263674 | RI_ConstraintTrigger_263754 |   1654 |      9 
| t         |
t              | tasks__ref_p   |        263721 | f
      | f              |       6 |        |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pi
d\000
  263674 | RI_ConstraintTrigger_263756 |   1655 |     17 
| t         |
t              | tasks__ref_p   |        263721 | f
      | f              |       6 |        |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pi
d\000
  263776 | RI_ConstraintTrigger_263808 |   1644 |     21 
| t         |
t              | appt__ref_pers |        263674 | t
      | f              |       6 |        |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000
pid\000
  263674 | RI_ConstraintTrigger_263810 |   1646 |      9 
| t         |
t              | appt__ref_pers |        263776 | t
      | f              |       6 |        |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000
pid\000
  263674 | RI_ConstraintTrigger_263812 |   1647 |     17 
| t         |
t              | appt__ref_pers |        263776 | t
      | f              |       6 |        |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000
pid\000
(6 rows)

The columns in pg_trigger are:

tgrelid	    What table is this trigger on. That is, 
which table is
inserted/updated/deleted that calls that this trigger.
tgname			 Name of this trigger. Trigger names are
generated by the referential integrity feature of 
PostgreSQL are are
unimportant, except that they are unique.
tgfoid	     OID of the function that is called. This is 
an important
column--by changing the function called, we can change 
which action
(CASCADE, etc.) is performed. (See below).
tgtype	  What kind of trigger is this (is it UPDATE, 
DELETE, etc.)
tgenabled Is this trigger enabled?
tgisconstraint	  Is this trigger part of a constraint. 
Non-RI
triggers that are user-declared by CREATE TRIGGER may be 
FALSE for
this, but RI triggers will always be true.
tgconstrname Name of constraint that calls this trigger. 
If you named
your constraints (as we did, eg pers__ref_tasks), this 
will be the
constraint name, otherwise it will be unnamed.
tgconstrrelid	 OID of table that which had the constraint 
set.
tgdeferrable	 Can this constraint be deferred? 
(Equivalent to
DEFERRABLE. See above.)
tginitdeferred	Is this constraint initially deferred? 
(Equivalent to
INITIALLY DEFERRED. See above.)
tgnargs	  Number of arguments for the RI function. As 
of PostgreSQL
7.1, this always seems to be 6, and should not be edited.
tgattr	  ???
tgargs	  The actual arguments to the RI function. 

To help make this more understandable and usable, a view 
is helpful:

CREATE VIEW dev_ri
AS 
SELECT   t.oid as trigoid,
          c.relname as trig_tbl,
          t.tgfoid, 
          f.proname as trigfunc,
          t.tgenabled,
          t.tgconstrname,
          c2.relname as const_tbl,
          t.tgdeferrable,
          t.tginitdeferred
FROM      pg_trigger t,
          pg_class c, 
          pg_class c2, 
          pg_proc f 
WHERE     t.tgrelid=c.oid 
 AND      t.tgconstrrelid=c2.oid 
 AND      tgfoid=f.oid
 AND      tgname ~ '^RI_'
ORDER BY  t.oid;

(This view requires PostgreSQL 7.1 because of the ORDER 
BY. For
versions earlier than 7.1, you can remove the ORDER BY.)

# select * from dev_ri;

 trigoid | trig_tbl | tgfoid |       trigfunc       | 
tgenabled |
tgconstrname  | const_tbl | tgdeferrable | tginitdeferred
---------+----------+--------+----------------------+----
-------+----------------+-----------+--------------+-----
-----------
  263753 | tasks    |   1644 | RI_FKey_check_ins    | t  
       |
tasks__ref_p   | pers      | f            | f
  263755 | pers     |   1654 | RI_FKey_noaction_del | t  
       |
tasks__ref_p   | tasks     | f            | f
  263757 | pers     |   1655 | RI_FKey_noaction_upd | t  
       |
tasks__ref_p   | tasks     | f            | f
  263809 | appts    |   1644 | RI_FKey_check_ins    | t  
       |
appt__ref_pers | pers      | t            | f
  263811 | pers     |   1646 | RI_FKey_cascade_del  | t  
       |
appt__ref_pers | appts     | t            | f
  263813 | pers     |   1647 | RI_FKey_cascade_upd  | t  
       |
appt__ref_pers | appts     | t            | f
(6 rows)

Now, it's much easier to understand what's happening. 
For example,
from trigger oid=263753, we can see that on inserts to 
the Tasks
table, RI_FKey_check_ins is called, which checks the 
Pers table.

From here, we can:

1) Temporarily disable a trigger. If you want to load 
lots of data,
and not slow down with referential integity checks (and 
without having
to deal w/transaction-level deferrment):

UPDATE pg_trigger SET tgenabled=FALSE WHERE oid=xxx;

(get the trigoid column from our view to find which 
trigger you want
to affect.)

[pg_dump files do this so that table data can be 
inserted in any
order, without having to worry about RI rules.]

2) Make a trigger DEFERRABLE, if it orginally wasn't, or 
vice-versa:

UPDATE pg_trigger SET tgdeferrable=[ TRUE | FALSE ] 
WHERE oid=xxx;

Or, to make a trigger INITIALLY DEFERRED (or turn this 
off):

UPDATE pg_trigger SET tginitdeferred=[ TRUE | FALSE ] 
WHERE oid=xxx;

3) Change the action for a trigger.

If you've created a trigger with an action (or with the 
default NO
ACTION action), you can change your mind by changing the 
function
called. To get the list of all RI trigger functions:

SELECT oid, proname FROM pg_proc where proname ~ '^RI_';
 oid  |        proname
------+------------------------
 1646 | RI_FKey_cascade_del
 1647 | RI_FKey_cascade_upd
 1644 | RI_FKey_check_ins
 1645 | RI_FKey_check_upd
 1654 | RI_FKey_noaction_del
 1655 | RI_FKey_noaction_upd
 1648 | RI_FKey_restrict_del
 1649 | RI_FKey_restrict_upd
 1652 | RI_FKey_setdefault_del
 1653 | RI_FKey_setdefault_upd
 1650 | RI_FKey_setnull_del
 1651 | RI_FKey_setnull_upd

(Your OIDs will probably be different. Note and use your 
own.)

In our example, to set updates on Pers(pid) to CASCADE, 
rather than NO
ACTION on Pers-Tasks,

UPDATE pg_trigger SET tgfoid=1647 WHERE oid=xxx;

(where xxx is our current noaction_upd trigger for Pers-
Tasks)


Notice that these changes often require a new backend. 
Quit and
restart psql, or reset your client connection, and you 
should be able
to test out your new settings.


--
Joel Burton, Director of Information Systems -*- jburton(at)scw(dot)org
Support Center of Washington (www.scw.org)

pgsql-docs by date

Next:From: Joel BurtonDate: 2000-12-04 16:04:37
Subject: Patches for docs
Previous:From: Joel BurtonDate: 2000-12-02 23:04:33
Subject: Referential Integrity / Hacking RI Tutorial

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group