EXISTS optimization

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>,<pgsql-performance(at)postgresql(dot)org>
Cc: "Andrea Olson" <Andrea(dot)Olson(at)wicourts(dot)gov>, "Bill Severson" <Bill(dot)Severson(at)wicourts(dot)gov>, "John Hutchins" <John(dot)Hutchins(at)wicourts(dot)gov>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Randy Peterson" <Randy(dot)Peterson(at)wicourts(dot)gov>, "Shannon Spranger" <Shannon(dot)Spranger(at)wicourts(dot)gov>
Subject: EXISTS optimization
Date: 2007-03-23 17:01:40
Message-ID: 4603C1A3.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

I'm posting this to performance in case our workaround may be of benefit to someone with a similar issue. I'm posting to hackers because I hope we can improve our planner in this area so that a workaround is not necessary. (It might make sense to reply to one group or the other, depending on reply content.)

We are converting from a commercial database (which shall remain unnamed here, due to license restrictions on publishing benchmarks). Most queries run faster on PostgreSQL; a small number choose very poor plans and run much longer. This particular query runs on the commercial product in 6.1s first time, 1.4s cached. In PostgreSQL it runs in about 144s both first time and cached. I was able to use an easy but fairly ugly rewrite (getting duplicate rows and eliminating them with DISTINCT) which runs on the commercial product in 9.2s/3.0s and in PostgreSQL in 2.0s/0.7s.

Here are the tables:

Table "public.TranHeader"
Column | Type | Modifiers
---------------+------------------+-----------
tranNo | "TranNoT" | not null
countyNo | "CountyNoT" | not null
acctPd | "DateT" | not null
date | "DateT" | not null
isComplete | boolean | not null
tranId | "TranIdT" | not null
tranType | "TranTypeT" | not null
userId | "UserIdT" | not null
workstationId | "WorkstationIdT" | not null
time | "TimeT" |
Indexes:
"TranHeader_pkey" PRIMARY KEY, btree ("tranNo", "countyNo")
"TranHeader_TranAcctPeriod" UNIQUE, btree ("acctPd", "tranNo", "countyNo")
"TranHeader_TranDate" UNIQUE, btree (date, "tranNo", "countyNo")

Table "public.TranDetail"
Column | Type | Modifiers
-----------------+--------------------+-----------
tranNo | "TranNoT" | not null
tranDetailSeqNo | "TranDetailSeqNoT" | not null
countyNo | "CountyNoT" | not null
acctCode | "AcctCodeT" | not null
amt | "MoneyT" | not null
assessNo | "TranIdT" |
caseNo | "CaseNoT" |
citnNo | "CitnNoT" |
citnViolDate | "DateT" |
issAgencyNo | "IssAgencyNoT" |
partyNo | "PartyNoT" |
payableNo | "PayableNoT" |
rcvblNo | "RcvblNoT" |
Indexes:
"TranDetail_pkey" PRIMARY KEY, btree ("tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetCaseNo" UNIQUE, btree ("caseNo", "tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetPay" UNIQUE, btree ("payableNo", "tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetRcvbl" UNIQUE, btree ("rcvblNo", "tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetAcct" btree ("acctCode", "citnNo", "countyNo")

Table "public.Adjustment"
Column | Type | Modifiers
-----------------+-----------------------+-----------
adjustmentNo | "TranIdT" | not null
countyNo | "CountyNoT" | not null
date | "DateT" | not null
isTranVoided | boolean | not null
reasonCode | "ReasonCodeT" | not null
tranNo | "TranNoT" | not null
adjustsTranId | "TranIdT" |
adjustsTranNo | "TranNoT" |
adjustsTranType | "TranTypeT" |
explanation | character varying(50) |
Indexes:
"Adjustment_pkey" PRIMARY KEY, btree ("adjustmentNo", "countyNo")
"Adjustment_AdjustsTranId" btree ("adjustsTranId", "adjustsTranType", "tranNo", "countyNo")
"Adjustment_AdjustsTranNo" btree ("adjustsTranNo", "tranNo", "countyNo")
"Adjustment_Date" btree (date, "countyNo")

Admittedly, the indexes are optimized for our query load under the commercial product, which can use the "covering index" optimization.

explain analyze
SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
FROM "Adjustment" "A"
JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
WHERE "H"."tranType" = 'A'
AND "A"."date" > DATE '2006-01-01'
AND "H"."countyNo" = 66
AND "A"."countyNo" = 66
AND EXISTS
(
SELECT 1 FROM "TranDetail" "D"
WHERE "D"."tranNo" = "H"."tranNo"
AND "D"."countyNo" = "H"."countyNo"
AND "D"."caseNo" LIKE '2006TR%'
)
;

Nested Loop (cost=182.56..72736.37 rows=1 width=46) (actual time=6398.108..143631.427 rows=2205 loops=1)
Join Filter: (("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar)
-> Bitmap Heap Scan on "Adjustment" "A" (cost=182.56..1535.69 rows=11542 width=22) (actual time=38.098..68.324 rows=12958 loops=1)
Recheck Cond: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66))
-> Bitmap Index Scan on "Adjustment_Date" (cost=0.00..179.67 rows=11542 width=0) (actual time=32.958..32.958 rows=12958 loops=1)
Index Cond: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66))
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..6.15 rows=1 width=46) (actual time=11.073..11.074 rows=0 loops=12958)
Index Cond: ((("H"."tranNo")::integer = ("A"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: ((("tranType")::bpchar = 'A'::bpchar) AND (subplan))
SubPlan
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..4.73 rows=1 width=0) (actual time=11.038..11.038 rows=0 loops=12958)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (("tranNo")::integer = ($0)::integer) AND (("countyNo")::smallint = ($1)::smallint))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)
Total runtime: 143633.838 ms

The commercial product scans the index on caseNo in TranDetail to build a work table of unique values, then uses indexed access to the TranHeader and then to Adjustment. I was able to get approximately the same plan (except the duplicates are eliminated at the end) in PostgreSQL by rewriting to this:

SELECT DISTINCT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
FROM "Adjustment" "A"
JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
JOIN "TranDetail" "D" ON ("D"."tranNo" = "H"."tranNo" AND "D"."countyNo" = "H"."countyNo" AND "D"."caseNo" LIKE '2006TR%')
WHERE "H"."tranType" = 'A'
AND "A"."date" > DATE '2006-01-01'
AND "H"."countyNo" = 66
AND "A"."countyNo" = 66
;

Unique (cost=130.96..130.98 rows=1 width=46) (actual time=694.591..715.008 rows=2205 loops=1)
-> Sort (cost=130.96..130.96 rows=1 width=46) (actual time=694.586..701.808 rows=16989 loops=1)
Sort Key: "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H".date, "H"."userId", "H"."time"
-> Nested Loop (cost=0.00..130.95 rows=1 width=46) (actual time=0.157..636.779 rows=16989 loops=1)
Join Filter: (("H"."tranNo")::integer = ("A"."tranNo")::integer)
-> Nested Loop (cost=0.00..113.76 rows=4 width=50) (actual time=0.131..452.544 rows=16989 loops=1)
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..27.57 rows=20 width=6) (actual time=0.049..83.005 rows=46293 loops=1)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (66 = ("countyNo")::smallint))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..4.30 rows=1 width=46) (actual time=0.006..0.007 rows=0 loops=46293)
Index Cond: ((("D"."tranNo")::integer = ("H"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: (("tranType")::bpchar = 'A'::bpchar)
-> Index Scan using "Adjustment_pkey" on "Adjustment" "A" (cost=0.00..4.28 rows=1 width=22) (actual time=0.007..0.008 rows=1 loops=16989)
Index Cond: ((("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar) AND (("A"."countyNo")::smallint = 66))
Filter: ((date)::date > '2006-01-01'::date)
Total runtime: 715.932 ms

I can't see any reason that PostgreSQL can't catch up to the other product on this optimization issue. This usage of DISTINCT seems a bit sloppy; I usually try to dissuade the application programmers from accumulating duplicates during the joins and then eliminating them in this way.

-Kevin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hugh Sasse 2007-03-23 17:15:17 Re: Documentation access problems.
Previous Message Joshua D. Drake 2007-03-23 16:59:11 Re: Documentation access problems.

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Stone 2007-03-23 17:13:02 Re: Parallel Vacuum
Previous Message Tino Wildenhain 2007-03-23 16:54:05 Re: Performance of count(*)