Proposed tutorial or chapter on RI/hacking RI in catalog

From: "Joel Burton" <jburton(at)scw(dot)org>
To: pgsql-docs(at)postgresql(dot)org
Subject: Proposed tutorial or chapter on RI/hacking RI in catalog
Date: 2000-12-03 22:29:20
Message-ID: 3A2A82F0.24487.D98117@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

[This is a report, since the original post never appeared. My
apologies if it's a duplicate for you.]

I've written the start of a tutorial on how to use the RI features of
Pgsql, and how to alter the system catalogs to change actions,
deferrability, etc. Much of the information is already in the docs, but
spread around different SQL reference sections, etc., so I thought it
would be useful in one place.

It's not in Docbook; I want to see first what structural changes
people
can recommend, etc., before it gets marked up.

It's still quite bit rough, so I'd hold off on grammar/spelling/etc., but
I would be interested if people have (a) tips not in it, (b) factual
errors, or (c) ideas about where it might fit into the docs.

I do read this group, but I always like a cc: to make sure I don't miss
anything: jburton(at)scw(dot)org(dot)

Thanks!

Referential Integrity Tutorial & Hacking Referential
Integrity:

Referential integrity is the feature of a database that
ensures that
implied relationships in a database are enforced. It is
a feature of
most database systems, and protects users from
accidentally (or
intentially!) creating discrepencies in their database.

In this chapter, we will work with a sample set of data
involving
people, tasks they are responsible for, and appointments
you have
scheduled with them. As you fire employees, you delete
them from your
person table, but want to be certain that you can't
leave tasks that
no one is in charge of, or leave appointments where the
persons name
cannot be located.

(If your not familiar with the concepts of Primary Keys
and Foreign
Keys, it might be helpful to review a basic database
primer, as this
tutorial assumes basic knowledge of these concepts.)

First, because we're going playing with the system
tables, it's best to do
this in a practice database:

# CREATE DATABASE RI_test;
# \c RI_test

Now, let's create practice tables. We'll have one parent
table, Pers, and
two child tables, Tasks and Appts.

The parent table is straightforward:

# CREATE TABLE Pers (pid INT NOT NULL PRIMARY KEY,
pname TEXT NOT NULL);

The first child table, which will hold tasks for which
this person is
responsible.

# CREATE TABLE Tasks (taskid SERIAL NOT NULL PRIMARY KEY,
pid INT NOT NULL
CONSTRAINT tasks__ref_p
REFERENCES Pers,
task TEXT NOT NULL);

A few notes:

* foreign key references are handled by PostgreSQL as a
type of
CONSTRAINT. Therefore, we can begin the referential
declaration with
CONSTRAINT <name of constraint>. If this is done, the
error messages
generated by PostgreSQL will return this error name
(which the user
could helpfully see, or which we could parse and handle
in a front-end
system.) By wrapping it in system identifiers ("), we
could even make
it look like an error message (eg CONSTRAINT "Table
tasks refers to table
Pers"), but we are limited in the length of this
name/message, and
therefore it's usefulness. CONSTRAINT names are not
neccessary, and we
could leave this off: ...pid INT NOT NULL REFERENCES
pers..., in which
case everything works the same, except our constraint is
unnamed, and
therefore the error messages are more generic.

* after REFERENCES comes the name of the table we are
referring to,
and (optionally), the name of the field in parentheses.
If the field
is the primary key, the fieldname is optional. If we had
not made pid
table Pers's primary key, we would have to say
REFERENCES Pers(pid).

* optionally, we can inform PostgreSQL how to handle
deletes/updates
on table Pers by adding ON DELETE and ON UPDATE
declarations. If these
are not made specific, then PostgreSQL defaults to "no
action", which
(at the time of this writing, 12/2000) means the same
thing as
"restrict"

[Ed: my digging into the source code shows that they're
treated the
same, though there is a comment about SQL3 requiring
that it be
trapped before... I'm not sure what this is referring
to, exactly, but
it still seems true that the mean the same. Is this
accurate?]

, which means that we could neither delete a parent nor
update the parent pid field if a child existed that
depended on that
parent. In our other child table, we'll see another way
to handle
this.

* optionally, we could tell PostgreSQL if this is
"deferrable", that
is, if we are allowed to violate referential integrity
while inside of
some transaction as long as we have fixed our integrity
problem at the
end of the transaction. We'll see examples of this
below, in
DEFERRING.

So, let's add some sample data:

INSERT INTO Pers VALUES (1, 'Jeff Brown');
INSERT INTO Pers VALUES (2, 'Maria Lane');

INSERT INTO Tasks (pid, task) VALUES (1, 'Write
contract');
INSERT INTO Tasks (pid, task) VALUES (1, 'Upgrade
database');

So far so good.

If we try to insert a child that has no parent, eg

INSERT INTO Tasks (pid, task) VALUES (3, 'Install
Linux');

we'll get a referential integrity error. This error will
refer to our
constraint name (if any), and will block the entering of
this data
into Tasks. (In addition, if this is part of a
transaction, it will
rollback the entire transaction, as always happens in
PostgreSQL).

Also, if we try to change a parent that has children:

DELETE FROM Pers WHERE pname = 'Jeff Brown';

We'll get the same problem, as we cannot delete Jeff as
long
as he has his two tasks assigned.

DELETE FROM Pers WHERE pname = 'Maria Lane';

(if you actually deleted Maria, add her back in for our
later
examples.)

because there are currently no child tasks for Maria.

Default (NO ACTION) (or RESTRICT, which means the same
thing to
PostgreSQL) ON UPDATE and ON DELETE rules make sense for
many
situations like this. You wouldn't want to be able to
delete a staff
person from your database if they had certain
resposibilities,
otherwise you would never know who was in charge of
different tasks.

A More Permissive Child Table: Appointments

Let's create a second child, Appts (for Appointments):

# CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY,
pid INT NOT NULL
CONSTRAINT appt__ref_pers
REFERENCES Pers
ON UPDATE CASCADE
ON DELETE CASCADE
DEFERRABLE,
apptsubj TEXT NOT NULL,
apptdate DATE NOT NULL);

and insert some data into it:

INSERT INTO Appts (pid, apptsubj, apptdate)
VALUES (1, 'Discuss raise', '2001-01-15');
INSERT INTO Appts (pid, apptsubj, apptdate)
VALUES (2, 'Plan project', '2001-01-20');

This works just fine. Also, just as before,

INSERT INTO Appts (pid, apptsubj, apptdate)
VALUES (3, 'Termination', '2001-01-20');

does not work, as there is no Pers with pid=3.

However, unlike the Tasks table, Appts is set to CASCADE
deletes and
updates. This means that if we update a person's pid, or
delete a
person entirely, PostgreSQL will allow this by first
deleting any
Tasks that exist for this person.

So:

UPDATE Pers SET pid=14 WHERE pid=2;

This will change Maria's pid to 14, first changing the
connecting pid
in the Appts table.

DELETE FROM Pers WHERE pid=14;

This will delete Maria's appointment, then delete Maria.

Other Relationships:

Note that CASCADE in this class only refers to the
CASCADE between the
Pers and Appt tables. Trying to DELETE Jeff would still
fail because,
although the Pers-Appt relationship would CASCADE, the
Pers-Tasks
relationship would fail, and PostgreSQL would report
this DELETE
attempt as a violation of that referential integrity.

Possible actions:

NO ACTION: (the default). Stop the action if an
update/delete would
fail referential integrity checking.

RESTRICT: currently means same thing as NO ACTION.

CASCADE: delete child data, then delete parent data
(subject to other
relationships, as noted above)

SET NULL: set foreign key field to NULL, then
update/delete parent.

SET DEFAULT: set foreign key field to DEFAULT value, then
update/delete parent.

SET NULL and SET DEFAULT can be useful options
(especially for ON
DELETE). For example, we might have a table, Offices,
that kept track
of which office a worker used. It might contain fields
for officeid,
officelocation, and persid. If we wanted to delete a
person, we
shouldn't be stopped just because that person has a
related office,
but similarly, we don't want to delete an office just
because this
person is being deleted. In some cases, the best option
might be to
set the persid field for Offices to NULL (or DEFAULT),
leaving the
office in place, but making it clear that this office is
now unused.

Note that ON UPDATE and ON DELETE can have different
rules. It's very
common, for instance to ON UPDATE CASCADE but ON DELETE
RESTRICT--allow people to change their person IDs, but
not allow
deleting of Tasks if a related person exists.

It is possible to change the actions for a relationship,
but requires
a little hacking in the system catalog tables. See
Hacking the
Relationships, below.

Deferring

By default, referential integrity is checked for every
single
relationship, for every single insert, delete, update
that could
affect this relationship. This means that

INSERT INTO Tasks (pid, task) VALUES (5, 'Open sales
office');
INSERT INTO Person (pid, name) VALUES (5, 'Helen Kim');

would fail, because at the time of the first attempted
insert, there
is no person with pid=5. Much of the time, this is the
most intuitive
setting.

Sometimes, however, you may not be able to predict the
exact order
data arrives. Perhaps you are receiving data loaded from
a text file,
or across the web. It's possible that the data may
arrive in the order
above (task, then associated person). When this happens,
you can
choose to defer the transaction checking.

In order to defer a transaction, two things must happen:

1) the referential integrity relationship must have been
defined as
DEFERRABLE. This is not the default (NOT DEFERRABLE is),
so it must be
declared explicity, as we did for Appts, above.

2) you must be in an explicit transaction.

3) you must either have DEFERRED be the initial default
for this
relationship, or have SET the relationship constraint to
DEFERRED for
this transaction.

For our Appts table relationship to Pers, we have
declared this is as
deferred. So, if we do

BEGIN;
SET CONSTRAINTS ALL DEFERRABLE;
INSERT INTO Appts (pid, apptsubj, apptdate)
VALUES (5, 'Negotiation', '2001-01-20');
INSERT INTO Pers VALUES (5, 'Helen Kim');
COMMIT;

works just fine. Note that having the relationship
declared as
DEFERRABLE is not enough--we must also use SET to
explicity set
CONSTRAINTS to deferred. In this example, we set all
relationships to
DEFERRED; instead we could set only a single constraint
to deferred,
as in

SET CONSTRAINTS appts__ref_pers DEFERRABLE;

It may be convenient to have a relationship already set,
rather than
having to set this for every transaction. To do this,
add "INITIALLY
DEFERRED" to the CONSTRAINT ... REFERENCES declaration
for the table,
eg

# CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY,
pid INT NOT NULL
CONSTRAINT appt__ref_pers
REFERENCES Pers
ON UPDATE CASCADE
ON DELETE CASCADE
DEFERRABLE
INITALLY DEFERRABLE,
apptsubj TEXT NOT NULL,
apptdate DATE NOT NULL);

Now, we could simply

BEGIN;
INSERT INTO Appts (pid, apptsubj, apptdate)
VALUES (5, 'Negotiation', '2001-01-20');
INSERT INTO Pers VALUES (5, 'Helen Kim');
COMMIT;

With no explicit SET command. This is especially
convenient if you
work in a programming setting that abstracts SQL
commands and makes it
difficult to execute an arbitrary, nonstandard SQL
standard such as
"SET...".

[Ed: I assume that deferred could be a bit faster for
large inserts,
too, since the subroutine would be exited out of quickly
until the
very last check. Is there enough time difference/can
anyone verify
this?]

Hacking Referential Integrity

Referential integrity works great in PostgreSQL.
However, PostgreSQL
does not (yet) have SQL-synax commands to change
actions, turn on/off
RI, etc. However, as this information is stored in the
system catalog
tables, it can be performed by editing these tables
directly.

WARNING: to edit the system catalog tables, you must be
a superuser in
PostgreSQL. In addition, you should be VERY CAREFULLY
when editing
these tables, and make sure that you have a backup first
(via
pg_dump). An accidental table-wide UPDATE or DELETE
could delete all
of your tables, ruins your indexes, corrupt your
database,
etc. Practice this by working in a test database,
preferrably even on
a machine without any other critical databases.

There are several system catalog tables of interest to
us:

pg_class all "classses", included tables, views,
sequences, etc.
pg_trigger all triggers. PostgreSQL handles
referential integrity
using behind-the-scenes triggers, so this is where all
of your RI
controls are stored.
pg_proc all PostgreSQL procedures. We won't need to
make any
changes to this, but will use this to show what the RI
procedures used
actually are.

To see (most) of the system tables, you can use the
command \dS in
psql.

For example, for our test tables, let's collect
information from
pg_class about our tables.

# SELECT oid, relname FROM pg_class WHERE relname IN
('pers','appts','tasks');

Note that PostgreSQL usually downcases non-quoted system
identifiers
like tables for us automatically (that is, I can CREATE
TABLE foo but
SELECT * FROM FOO.) However, when examining pg_class,
you must work in
a case-sensitive manner, or use case-insensitive
operators.

This query returns

oid | relname
--------+---------
9100 | pers
9110 | tasks
9120 | appts
(3 rows)

(Your OIDs will be different. Don't worry, just notice
what they are.)

Now, if we look in pg_trigger, we can find the triggers
that are used
by our tables.

test2=# SELECT * from pg_trigger WHERE tgrelid in
(9100, 9110, 9120);

tgrelid | tgname | tgfoid | tgtype
| tgenabled |
tgisconstraint | tgconstrname | tgconstrrelid |
tgdeferrable |
tginitdeferred | tgnargs | tgattr |
tgargs
---------+-----------------------------+--------+--------
+-----------+----------------+----------------+----------
-----+--------------+----------------+---------+--------
+--------------------------------------------------------
----------
263721 | RI_ConstraintTrigger_263752 | 1644 | 21
| t |
t | tasks__ref_p | 263674 | f
| f | 6 | |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pi
d\000
263674 | RI_ConstraintTrigger_263754 | 1654 | 9
| t |
t | tasks__ref_p | 263721 | f
| f | 6 | |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pi
d\000
263674 | RI_ConstraintTrigger_263756 | 1655 | 17
| t |
t | tasks__ref_p | 263721 | f
| f | 6 | |
tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pi
d\000
263776 | RI_ConstraintTrigger_263808 | 1644 | 21
| t |
t | appt__ref_pers | 263674 | t
| f | 6 | |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000
pid\000
263674 | RI_ConstraintTrigger_263810 | 1646 | 9
| t |
t | appt__ref_pers | 263776 | t
| f | 6 | |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000
pid\000
263674 | RI_ConstraintTrigger_263812 | 1647 | 17
| t |
t | appt__ref_pers | 263776 | t
| f | 6 | |
appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000
pid\000
(6 rows)

The columns in pg_trigger are:

tgrelid What table is this trigger on. That is,
which table is
inserted/updated/deleted that calls that this trigger.
tgname Name of this trigger. Trigger names are
generated by the referential integrity feature of
PostgreSQL are are
unimportant, except that they are unique.
tgfoid OID of the function that is called. This is
an important
column--by changing the function called, we can change
which action
(CASCADE, etc.) is performed. (See below).
tgtype What kind of trigger is this (is it UPDATE,
DELETE, etc.)
tgenabled Is this trigger enabled?
tgisconstraint Is this trigger part of a constraint.
Non-RI
triggers that are user-declared by CREATE TRIGGER may be
FALSE for
this, but RI triggers will always be true.
tgconstrname Name of constraint that calls this trigger.
If you named
your constraints (as we did, eg pers__ref_tasks), this
will be the
constraint name, otherwise it will be unnamed.
tgconstrrelid OID of table that which had the constraint
set.
tgdeferrable Can this constraint be deferred?
(Equivalent to
DEFERRABLE. See above.)
tginitdeferred Is this constraint initially deferred?
(Equivalent to
INITIALLY DEFERRED. See above.)
tgnargs Number of arguments for the RI function. As
of PostgreSQL
7.1, this always seems to be 6, and should not be edited.
tgattr ???
tgargs The actual arguments to the RI function.

To help make this more understandable and usable, a view
is helpful:

CREATE VIEW dev_ri
AS
SELECT t.oid as trigoid,
c.relname as trig_tbl,
t.tgfoid,
f.proname as trigfunc,
t.tgenabled,
t.tgconstrname,
c2.relname as const_tbl,
t.tgdeferrable,
t.tginitdeferred
FROM pg_trigger t,
pg_class c,
pg_class c2,
pg_proc f
WHERE t.tgrelid=c.oid
AND t.tgconstrrelid=c2.oid
AND tgfoid=f.oid
AND tgname ~ '^RI_'
ORDER BY t.oid;

(This view requires PostgreSQL 7.1 because of the ORDER
BY. For
versions earlier than 7.1, you can remove the ORDER BY.)

# select * from dev_ri;

trigoid | trig_tbl | tgfoid | trigfunc |
tgenabled |
tgconstrname | const_tbl | tgdeferrable | tginitdeferred
---------+----------+--------+----------------------+----
-------+----------------+-----------+--------------+-----
-----------
263753 | tasks | 1644 | RI_FKey_check_ins | t
|
tasks__ref_p | pers | f | f
263755 | pers | 1654 | RI_FKey_noaction_del | t
|
tasks__ref_p | tasks | f | f
263757 | pers | 1655 | RI_FKey_noaction_upd | t
|
tasks__ref_p | tasks | f | f
263809 | appts | 1644 | RI_FKey_check_ins | t
|
appt__ref_pers | pers | t | f
263811 | pers | 1646 | RI_FKey_cascade_del | t
|
appt__ref_pers | appts | t | f
263813 | pers | 1647 | RI_FKey_cascade_upd | t
|
appt__ref_pers | appts | t | f
(6 rows)

Now, it's much easier to understand what's happening.
For example,
from trigger oid=263753, we can see that on inserts to
the Tasks
table, RI_FKey_check_ins is called, which checks the
Pers table.

From here, we can:

1) Temporarily disable a trigger. If you want to load
lots of data,
and not slow down with referential integity checks (and
without having
to deal w/transaction-level deferrment):

UPDATE pg_trigger SET tgenabled=FALSE WHERE oid=xxx;

(get the trigoid column from our view to find which
trigger you want
to affect.)

[pg_dump files do this so that table data can be
inserted in any
order, without having to worry about RI rules.]

2) Make a trigger DEFERRABLE, if it orginally wasn't, or
vice-versa:

UPDATE pg_trigger SET tgdeferrable=[ TRUE | FALSE ]
WHERE oid=xxx;

Or, to make a trigger INITIALLY DEFERRED (or turn this
off):

UPDATE pg_trigger SET tginitdeferred=[ TRUE | FALSE ]
WHERE oid=xxx;

3) Change the action for a trigger.

If you've created a trigger with an action (or with the
default NO
ACTION action), you can change your mind by changing the
function
called. To get the list of all RI trigger functions:

SELECT oid, proname FROM pg_proc where proname ~ '^RI_';
oid | proname
------+------------------------
1646 | RI_FKey_cascade_del
1647 | RI_FKey_cascade_upd
1644 | RI_FKey_check_ins
1645 | RI_FKey_check_upd
1654 | RI_FKey_noaction_del
1655 | RI_FKey_noaction_upd
1648 | RI_FKey_restrict_del
1649 | RI_FKey_restrict_upd
1652 | RI_FKey_setdefault_del
1653 | RI_FKey_setdefault_upd
1650 | RI_FKey_setnull_del
1651 | RI_FKey_setnull_upd

(Your OIDs will probably be different. Note and use your
own.)

In our example, to set updates on Pers(pid) to CASCADE,
rather than NO
ACTION on Pers-Tasks,

UPDATE pg_trigger SET tgfoid=1647 WHERE oid=xxx;

(where xxx is our current noaction_upd trigger for Pers-
Tasks)

Notice that these changes often require a new backend.
Quit and
restart psql, or reset your client connection, and you
should be able
to test out your new settings.

--
Joel Burton, Director of Information Systems -*- jburton(at)scw(dot)org
Support Center of Washington (www.scw.org)

Browse pgsql-docs by date

  From Date Subject
Next Message Joel Burton 2000-12-04 16:04:37 Patches for docs
Previous Message Joel Burton 2000-12-02 23:04:33 Referential Integrity / Hacking RI Tutorial