activating RI constraints too slow

From: Preston Landers <planders(at)journyx(dot)com>
To: <pgsql-bugs(at)postgresql(dot)org>
Subject: activating RI constraints too slow
Date: 2003-08-29 21:08:40
Message-ID: Pine.LNX.4.33.0308281440140.6957-100000@dev2.int.journyx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

============================================================================
POSTGRESQL BUG REPORT TEMPLATE
============================================================================

Your name : Preston Landers
Your email address : planders(at)journyx(dot)com

System Configuration
---------------------
Architecture (example: Intel Pentium) :
Intel Pentium II dual 500 mhz

Operating System (example: Linux 2.0.26 ELF) :
Linux 2.4.2-2smp (Redhat 7.1)

PostgreSQL version (example: PostgreSQL-7.3):
PostgreSQL-7.4beta2 snapshot (from 2003/8/26.)

Compiler used (example: gcc 2.95.2) :
GCC 2.96

Please enter a FULL description of your problem:
------------------------------------------------

This is not a "bug" report, just a performance complaint unaccompanied
by a patch. ;-)

Enabling referential integrity is painfully slow.

Enabling a foreign key constraint appears to take time proportional to
the number of parent and child records, but even longer than an
equivelent "check for bad records" query would.

With a good number of records (more than a few hundred thousand) this
can take a painfully long time. All of the time appears to be taken
in the query that checks for non-conforming records, not in creating
the rules. I *do* have indexes on all relevent columns. The columns
are the same data-type. It may be that I am simply doing something
else wrong but I sure can't identify it. I would appreciate any
suggestions.

Oracle and SQL Server, at least, seem to handle these checks in a much
more reasonable amount of time (less than 1/10th the time,) for
whatever that's worth.

We are currently using PG 7.2.4. I was hoping that upgrading to 7.4
would fix this, as I understand there are a number of performances
fixes. Unfortunately, this does not seem to be addressed yet. (I'm
testing using a snapshot from 2003/08/26.)

Please describe a way to repeat the problem. Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

I'm pretty sure this has been brought up before on various mailing
lists. At least when I search around, I can find people posing
similar questions, but I can never seem to find a response or
solution.

Anyway, I can provide a sample database or SQL script that illustrates
the problem, if neccesary. The trick is to actually have hundreds of
thousands of records in there.

Out of all my tables, the FK constraint that takes longest to activate
is this one.

162.65 seconds for query:
ALTER TABLE "time_recs_attribs"
ADD CONSTRAINT f904_1_123
FOREIGN KEY ("id_time_rec") REFERENCES "time_recs" ("id_time_rec")
ON DELETE CASCADE;

time_recs (parent table) has 281,043 records in this particular
database. time_recs_attribs (child) has 246,890 records.

The column time_recs_attribs.id_time_rec (child) does have an
index. The parent table automatically has an index on the primary key
column (the one I am referencing.) I have tried doing a VACUUM FULL
just prior to activating the constraint to no avail.

In both tables the datatypes are VARCHAR(30).

With the same data and same schema under SQL Server, this ALTER TABLE
statement takes about 3 seconds on the same (actually, slightly worse)
hardware.

For another example, we have a constraint on the "time_recs" table
referencing the primary key of the "projects" table.

timesheet=# select count(id_project) from projects;
count
-------
462

timesheet=# select count(id_time_rec) from time_recs;
count
--------
101045

template1=# select count(distinct(id_project)) from time_recs;
count
-------
285

timesheet=# explain analyze
timesheet-# select id_time_rec from time_recs
timesheet-# where not exists ( select id_project from projects where projects.id_project = time_recs.id_project );
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on time_recs (cost=0.00..591579.89 rows=50523 width=33) (actual time=9278.47..9278.47 rows=0 loops=1)
Filter: (NOT (subplan))
SubPlan
-> Index Scan using projects_pkey on projects (cost=0.00..5.82 rows=1 width=33) (actual time=0.07..0.07 rows=1 loops=101045)
Index Cond: ((id_project)::text = ($0)::text)
Total runtime: 9278.70 msec
(6 rows)

Time: 9282.38 ms

template1=# ALTER TABLE "time_recs" ADD CONSTRAINT f123_5_108 FOREIGN KEY ("id_project") REFERENCES "projects" ("id_project");
NOTICE: ALTER TABLE will create implicit trigger(s) for FOREIGN KEY check(s)
ALTER TABLE
Time: 27137.22 ms

Why does the ALTER TABLE statement take three times as long as the
equivelent check query?

When I enable debug logging on my pgres server, I see this:

SELECT 1 FROM ONLY "public"."projects" x WHERE "id_project" = $1 FOR UPDATE OF x

It seems to be doing this query behind the scenes, iterating over each
record in time_recs. I guess that this might be the source of the
inefficiency. I have no idea how to address this in the postgresql
design -- just wanted to point it out in case it had been unnoticed.

Or could this just be purely a configuration issue? Most of the PG
config file settings in these databases are at their default values.

Please let me know if you need an actual sample database to better
illustrate this.

By the way, Microsoft SQL Server at least seems to use index scans for
both tables (parent and child), instead of a sequential scan over the
child table, in this case time_recs.

The equivelent "check query" in SQL Server also uses index scans for
both tables, using the "Hash Match / Right Anti Semi Join" method to
sort them together. (This is according to the Query Analyzer.)

If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

See above...

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2003-08-30 15:57:43 Re: bug report: pg_dump does not use CASCADE in DROP
Previous Message Tom Lane 2003-08-29 20:24:11 Re: about create type