Referential Integrity Tutorial & Hacking Referential Integrity: Referential integrity is the feature of a database that ensures that implied relationships in a database are enforced. It is a feature of most database systems, and protects users from accidentally (or intentially!) creating discrepencies in their database. In this chapter, we will work with a sample set of data involving people, tasks they are responsible for, and appointments you have scheduled with them. As you fire employees, you delete them from your person table, but want to be certain that you can't leave tasks that no one is in charge of, or leave appointments where the persons name cannot be located. (If your not familiar with the concepts of Primary Keys and Foreign Keys, it might be helpful to review a basic database primer, as this tutorial assumes basic knowledge of these concepts.) First, because we're going playing with the system tables, it's best to do this in a practice database: # CREATE DATABASE RI_test; # \c RI_test Now, let's create practice tables. We'll have one parent table, p, and two child tables, c1 and c2. The parent table is straightforward: # CREATE TABLE Pers (pid INT NOT NULL PRIMARY KEY, pname TEXT NOT NULL); The first child table, which will hold tasks for which this person is responsible. # CREATE TABLE Tasks (taskid SERIAL NOT NULL PRIMARY KEY, pid INT NOT NULL CONSTRAINT tasks__ref_p REFERENCES Pers, task TEXT NOT NULL); A few notes: * foreign key references are handled by PostgreSQL as a type of CONSTRAINT. Therefore, we can begin the referential declaration with CONSTRAINT . If this is done, the error messages generated by PostgreSQL will return this error name (which the user could helpfully see, or which we could parse and handle in a front-end system.) By wrapping it in system identifiers ("), we could even make it look like an error message (eg CONSTRAINT "Table tasks refers to table Pers"), but we are limited in the length of this name/message, and therefore it's usefulness. CONSTRAINT names are not neccessary, and we could leave this off: ...pid INT NOT NULL REFERENCES pers..., in which case everything works the same, except our constraint is unnamed, and therefore the error messages are more generic. * after REFERENCES comes the name of the table we are referring to, and (optionally), the name of the field in parentheses. If the field is the primary key, the fieldname is optional. If we had not made pid table Pers's primary key, we would have to say REFERENCES Pers(pid). * optionally, we can inform PostgreSQL how to handle deletes/updates on table Pers by adding ON DELETE and ON UPDATE declarations. If these are not made specific, then PostgreSQL defaults to "no action", which (at the time of this writing, 12/2000) means the same thing as "restrict" [Ed: my digging into the source code shows that they're treated the same, though there is a comment about SQL3 requiring that it be trapped before... I'm not sure what this is referring to, exactly, but it still seems true that the mean the same. Is this accurate?] , which means that we could neither delete a parent nor update the parent pid field if a child existed that depended on that parent. In our other child table, we'll see another way to handle this. * optionally, we could tell PostgreSQL if this is "deferrable", that is, if we are allowed to violate referential integrity while inside of some transaction as long as we have fixed our integrity problem at the end of the transaction. We'll see examples of this below, in DEFERRING. So, let's add some sample data: INSERT INTO Pers VALUES (1, 'Jeff Brown'); INSERT INTO Pers VALUES (2, 'Maria Lane'); INSERT INTO Tasks (pid, task) VALUES (1, 'Write contract'); INSERT INTO Tasks (pid, task) VALUES (1, 'Upgrade database'); So far so good. If we try to insert a child that has no parent, eg INSERT INTO Tasks (pid, task) VALUES (3, 'Install Linux'); we'll get a referential integrity error. This error will refer to our constraint name (if any), and will block the entering of this data into Tasks. (In addition, if this is part of a transaction, it will rollback the entire transaction, as always happens in PostgreSQL). Also, if we try to change a parent that has children: DELETE FROM Pers WHERE pname = 'Jeff Brown'; We'll get the same problem, as we cannot delete Jeff as long as he has his two tasks assigned. DELETE FROM Pers WHERE pname = 'Maria Lane'; (if you actually deleted Maria, add her back in for our later examples.) because there are currently no child tasks for Maria. Default (NO ACTION) (or RESTRICT, which means the same thing to PostgreSQL) ON UPDATE and ON DELETE rules make sense for many situations like this. You wouldn't want to be able to delete a staff person from your database if they had certain resposibilities, otherwise you would never know who was in charge of different tasks. A More Permissive Child Table: Appointments Let's create a second child, Appts (for Appointments): # CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY, pid INT NOT NULL CONSTRAINT appt__ref_pers REFERENCES Pers ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE, apptsubj TEXT NOT NULL, apptdate DATE NOT NULL); and insert some data into it: INSERT INTO Appts (pid, apptsubj, apptdate) VALUES (1, 'Discuss raise', '2001-01-15'); INSERT INTO Appts (pid, apptsubj, apptdate) VALUES (2, 'Plan project', '2001-01-20'); This works just fine. Also, just as before, INSERT INTO Appts (pid, apptsubj, apptdate) VALUES (3, 'Termination', '2001-01-20'); does not work, as there is no Pers with pid=3. However, unlike the Tasks table, Appts is set to CASCADE deletes and updates. This means that if we update a person's pid, or delete a person entirely, PostgreSQL will allow this by first deleting any Tasks that exist for this person. So: UPDATE Pers SET pid=14 WHERE pid=2; This will change Maria's pid to 14, first changing the connecting pid in the Appts table. DELETE FROM Pers WHERE pid=14; This will delete Maria's appointment, then delete Maria. Other Relationships: Note that CASCADE in this class only refers to the CASCADE between the Pers and Appt tables. Trying to DELETE Jeff would still fail because, although the Pers-Appt relationship would CASCADE, the Pers-Tasks relationship would fail, and PostgreSQL would report this DELETE attempt as a violation of that referential integrity. Possible actions: NO ACTION: (the default). Stop the action if an update/delete would fail referential integrity checking. RESTRICT: currently means same thing as NO ACTION. CASCADE: delete child data, then delete parent data (subject to other relationships, as noted above) SET NULL: set foreign key field to NULL, then update/delete parent. SET DEFAULT: set foreign key field to DEFAULT value, then update/delete parent. SET NULL and SET DEFAULT can be useful options (especially for ON DELETE). For example, we might have a table, Offices, that kept track of which office a worker used. It might contain fields for officeid, officelocation, and persid. If we wanted to delete a person, we shouldn't be stopped just because that person has a related office, but similarly, we don't want to delete an office just because this person is being deleted. In some cases, the best option might be to set the persid field for Offices to NULL (or DEFAULT), leaving the office in place, but making it clear that this office is now unused. Note that ON UPDATE and ON DELETE can have different rules. It's very common, for instance to ON UPDATE CASCADE but ON DELETE RESTRICT--allow people to change their person IDs, but not allow deleting of Tasks if a related person exists. It is possible to change the actions for a relationship, but requires a little hacking in the system catalog tables. See Hacking the Relationships, below. Deferring By default, referential integrity is checked for every single relationship, for every single insert, delete, update that could affect this relationship. This means that INSERT INTO Tasks (pid, task) VALUES (5, 'Open sales office'); INSERT INTO Person (pid, name) VALUES (5, 'Helen Kim'); would fail, because at the time of the first attempted insert, there is no person with pid=5. Much of the time, this is the most intuitive setting. Sometimes, however, you may not be able to predict the exact order data arrives. Perhaps you are receiving data loaded from a text file, or across the web. It's possible that the data may arrive in the order above (task, then associated person). When this happens, you can choose to defer the transaction checking. In order to defer a transaction, two things must happen: 1) the referential integrity relationship must have been defined as DEFERRABLE. This is not the default (NOT DEFERRABLE is), so it must be declared explicity, as we did for Appts, above. 2) you must be in an explicit transaction. 3) you must either have DEFERRED be the initial default for this relationship, or have SET the relationship constraint to DEFERRED for this transaction. For our Appts table relationship to Pers, we have declared this is as deferred. So, if we do BEGIN; SET CONSTRAINTS ALL DEFERRABLE; INSERT INTO Appts (pid, apptsubj, apptdate) VALUES (5, 'Negotiation', '2001-01-20'); INSERT INTO Pers VALUES (5, 'Helen Kim'); COMMIT; works just fine. Note that having the relationship declared as DEFERRABLE is not enough--we must also use SET to explicity set CONSTRAINTS to deferred. In this example, we set all relationships to DEFERRED; instead we could set only a single constraint to deferred, as in SET CONSTRAINTS appts__ref_pers DEFERRABLE; It may be convenient to have a relationship already set, rather than having to set this for every transaction. To do this, add "INITIALLY DEFERRED" to the CONSTRAINT ... REFERENCES declaration for the table, eg # CREATE TABLE Appts (apptid SERIAL NOT NULL PRIMARY KEY, pid INT NOT NULL CONSTRAINT appt__ref_pers REFERENCES Pers ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITALLY DEFERRABLE, apptsubj TEXT NOT NULL, apptdate DATE NOT NULL); Now, we could simply BEGIN; INSERT INTO Appts (pid, apptsubj, apptdate) VALUES (5, 'Negotiation', '2001-01-20'); INSERT INTO Pers VALUES (5, 'Helen Kim'); COMMIT; With no explicit SET command. This is especially convenient if you work in a programming setting that abstracts SQL commands and makes it difficult to execute an arbitrary, nonstandard SQL standard such as "SET...". [Ed: I assume that deferred could be a bit faster for large inserts, too, since the subroutine would be exited out of quickly until the very last check. Is there enough time difference/can anyone verify this?] Hacking Referential Integrity Referential integrity works great in PostgreSQL. However, PostgreSQL does not (yet) have SQL-synax commands to change actions, turn on/off RI, etc. However, as this information is stored in the system catalog tables, it can be performed by editing these tables directly. WARNING: to edit the system catalog tables, you must be a superuser in PostgreSQL. In addition, you should be VERY CAREFULLY when editing these tables, and make sure that you have a backup first (via pg_dump). An accidental table-wide UPDATE or DELETE could delete all of your tables, ruins your indexes, corrupt your database, etc. Practice this by working in a test database, preferrably even on a machine without any other critical databases. There are several system catalog tables of interest to us: pg_class all "classses", included tables, views, sequences, etc. pg_trigger all triggers. PostgreSQL handles referential integrity using behind-the-scenes triggers, so this is where all of your RI controls are stored. pg_proc all PostgreSQL procedures. We won't need to make any changes to this, but will use this to show what the RI procedures used actually are. To see (most) of the system tables, you can use the command \dS in psql. For example, for our test tables, let's collect information from pg_class about our tables. # SELECT oid, relname FROM pg_class WHERE relname IN ('pers','appts','tasks'); Note that PostgreSQL usually downcases non-quoted system identifiers like tables for us automatically (that is, I can CREATE TABLE foo but SELECT * FROM FOO.) However, when examining pg_class, you must work in a case-sensitive manner, or use case-insensitive operators. This query returns oid | relname --------+--------- 9100 | pers 9110 | tasks 9120 | appts (3 rows) (Your OIDs will be different. Don't worry, just notice what they are.) Now, if we look in pg_trigger, we can find the triggers that are used by our tables. test2=# SELECT * from pg_trigger WHERE tgrelid in (9100, 9110, 9120); tgrelid | tgname | tgfoid | tgtype | tgenabled | tgisconstraint | tgconstrname | tgconstrrelid | tgdeferrable | tginitdeferred | tgnargs | tgattr | tgargs ---------+-----------------------------+--------+--------+-----------+----------------+----------------+---------------+--------------+----------------+---------+--------+------------------------------------------------------------------ 263721 | RI_ConstraintTrigger_263752 | 1644 | 21 | t | t | tasks__ref_p | 263674 | f | f | 6 | | tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pid\000 263674 | RI_ConstraintTrigger_263754 | 1654 | 9 | t | t | tasks__ref_p | 263721 | f | f | 6 | | tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pid\000 263674 | RI_ConstraintTrigger_263756 | 1655 | 17 | t | t | tasks__ref_p | 263721 | f | f | 6 | | tasks__ref_p\000tasks\000pers\000UNSPECIFIED\000pid\000pid\000 263776 | RI_ConstraintTrigger_263808 | 1644 | 21 | t | t | appt__ref_pers | 263674 | t | f | 6 | | appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000pid\000 263674 | RI_ConstraintTrigger_263810 | 1646 | 9 | t | t | appt__ref_pers | 263776 | t | f | 6 | | appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000pid\000 263674 | RI_ConstraintTrigger_263812 | 1647 | 17 | t | t | appt__ref_pers | 263776 | t | f | 6 | | appt__ref_pers\000appts\000pers\000UNSPECIFIED\000pid\000pid\000 (6 rows) The columns in pg_trigger are: tgrelid What table is this trigger on. That is, which table is inserted/updated/deleted that calls that this trigger. tgname Name of this trigger. Trigger names are generated by the referential integrity feature of PostgreSQL are are unimportant, except that they are unique. tgfoid OID of the function that is called. This is an important column--by changing the function called, we can change which action (CASCADE, etc.) is performed. (See below). tgtype What kind of trigger is this (is it UPDATE, DELETE, etc.) tgenabled Is this trigger enabled? tgisconstraint Is this trigger part of a constraint. Non-RI triggers that are user-declared by CREATE TRIGGER may be FALSE for this, but RI triggers will always be true. tgconstrname Name of constraint that calls this trigger. If you named your constraints (as we did, eg pers__ref_tasks), this will be the constraint name, otherwise it will be unnamed. tgconstrrelid OID of table that which had the constraint set. tgdeferrable Can this constraint be deferred? (Equivalent to DEFERRABLE. See above.) tginitdeferred Is this constraint initially deferred? (Equivalent to INITIALLY DEFERRED. See above.) tgnargs Number of arguments for the RI function. As of PostgreSQL 7.1, this always seems to be 6, and should not be edited. tgattr ??? tgargs The actual arguments to the RI function. To help make this more understandable and usable, a view is helpful: CREATE VIEW dev_ri AS SELECT t.oid as trigoid, c.relname as trig_tbl, t.tgfoid, f.proname as trigfunc, t.tgenabled, t.tgconstrname, c2.relname as const_tbl, t.tgdeferrable, t.tginitdeferred FROM pg_trigger t, pg_class c, pg_class c2, pg_proc f WHERE t.tgrelid=c.oid AND t.tgconstrrelid=c2.oid AND tgfoid=f.oid AND tgname ~ '^RI_' ORDER BY t.oid; (This view requires PostgreSQL 7.1 because of the ORDER BY. For versions earlier than 7.1, you can remove the ORDER BY.) # select * from dev_ri; trigoid | trig_tbl | tgfoid | trigfunc | tgenabled | tgconstrname | const_tbl | tgdeferrable | tginitdeferred ---------+----------+--------+----------------------+-----------+----------------+-----------+--------------+---------------- 263753 | tasks | 1644 | RI_FKey_check_ins | t | tasks__ref_p | pers | f | f 263755 | pers | 1654 | RI_FKey_noaction_del | t | tasks__ref_p | tasks | f | f 263757 | pers | 1655 | RI_FKey_noaction_upd | t | tasks__ref_p | tasks | f | f 263809 | appts | 1644 | RI_FKey_check_ins | t | appt__ref_pers | pers | t | f 263811 | pers | 1646 | RI_FKey_cascade_del | t | appt__ref_pers | appts | t | f 263813 | pers | 1647 | RI_FKey_cascade_upd | t | appt__ref_pers | appts | t | f (6 rows) Now, it's much easier to understand what's happening. For example, from trigger oid=263753, we can see that on inserts to the Tasks table, RI_FKey_check_ins is called, which checks the Pers table. From here, we can: 1) Temporarily disable a trigger. If you want to load lots of data, and not slow down with referential integity checks (and without having to deal w/transaction-level deferrment): UPDATE pg_trigger SET tgenabled=FALSE WHERE oid=xxx; (get the trigoid column from our view to find which trigger you want to affect.) [pg_dump files do this so that table data can be inserted in any order, without having to worry about RI rules.] 2) Make a trigger DEFERRABLE, if it orginally wasn't, or vice-versa: UPDATE pg_trigger SET tgdeferrable=[ TRUE | FALSE ] WHERE oid=xxx; Or, to make a trigger INITIALLY DEFERRED (or turn this off): UPDATE pg_trigger SET tginitdeferred=[ TRUE | FALSE ] WHERE oid=xxx; 3) Change the action for a trigger. If you've created a trigger with an action (or with the default NO ACTION action), you can change your mind by changing the function called. To get the list of all RI trigger functions: SELECT oid, proname FROM pg_proc where proname ~ '^RI_'; oid | proname ------+------------------------ 1646 | RI_FKey_cascade_del 1647 | RI_FKey_cascade_upd 1644 | RI_FKey_check_ins 1645 | RI_FKey_check_upd 1654 | RI_FKey_noaction_del 1655 | RI_FKey_noaction_upd 1648 | RI_FKey_restrict_del 1649 | RI_FKey_restrict_upd 1652 | RI_FKey_setdefault_del 1653 | RI_FKey_setdefault_upd 1650 | RI_FKey_setnull_del 1651 | RI_FKey_setnull_upd (Your OIDs will probably be different. Note and use your own.) In our example, to set updates on Pers(pid) to CASCADE, rather than NO ACTION on Pers-Tasks, UPDATE pg_trigger SET tgfoid=1647 WHERE oid=xxx; (where xxx is our current noaction_upd trigger for Pers-Tasks) Notice that these changes often require a new backend. Quit and restart psql, or reset your client connection, and you should be able to test out your new settings.