Re: UPDATE with subquery too slow

From: "Eric Jain" <Eric(dot)Jain(at)isb-sib(dot)ch>
To: "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: UPDATE with subquery too slow
Date: 2004-02-18 19:11:58
Message-ID: 013401c3f653$14f95970$c300000a@caliente
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> I can't get the following statement to complete with reasonable time.

Upgraded to 7.4.1, and realized that NOT IN is far more efficient than
IN, EXISTS or NOT EXISTS, at least for the amount and distribution of
data that I have. Here are some numbers from before and after performing
the problematic clean up operation:

| Before | After
------------------------+-----------+-----------
COUNT(*) | 6'104'075 | 6'104'075
COUNT(session) | 5'945'272 | 3'640'659
COUNT(DISTINCT session) | 2'865'570 | 560'957

The following query completes within less than three hours on a machine
with a high load, versa many many hours for any of the alternatives:

UPDATE requests
SET session = NULL
WHERE session NOT IN
(
SELECT r.session
FROM requests r
WHERE r.session IS NOT NULL
GROUP BY r.session
HAVING COUNT(*) > 1
);

Note that in order to correctly reverse an IN subquery, IS NOT NULL
needs to be added.

Interestingly, the query planner believes that using EXISTS would be
more efficient than NOT IN, and IN only slightly less efficient; I
assume the query planner is not able to accurately estimate the number
of rows returned by the subquery.

EXISTS 351'511
NOT IN 376'577
IN 386'780
LEFT JOIN 18'263'826
NOT EXISTS 7'241'815'330

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Lazarus 2004-02-18 22:51:52 JOIN order, 15K, 15K, 7MM rows
Previous Message Tom Lane 2004-02-18 16:55:22 Re: Optimizer difference using function index between 7.3 and 7.4