wildcard text filter switched to boolean column, performance is way worse

From: Mike Broers <mbroers(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: wildcard text filter switched to boolean column, performance is way worse
Date: 2015-07-07 15:40:48
Message-ID: CAB9893ibPu+ABPfsxpL45GbcHZx4Heg4EndzuE=QPKHunNY68Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I had a query that was filtering with a wildcard search of a text field for
%SUCCESS%. The query took about 5 seconds and was running often so I wanted
to improve it. I suggested that the engineers include a new boolean column
for successful status. They implemented the requested field, but the query
that filters on that new column runs very long (i kill it after letting it
run for about an hour). Can someone help me understand why that is the
case and how to resolve it?

First query:
SELECT *
FROM "lead"
WHERE ( NOT ( ( "lead"."id" IN
( SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" = 'type_1' )
OR ( "lead"."id" IN
( SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" = 'type_2' )
AND "lead"."id" IN
( SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE UPPER(U1."response"::text) LIKE
UPPER('%success%') ) ) ) )
AND NOT ("lead"."ReferenceNumber" = '') ) ;

explain/analyze result:

Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual
time=4699.619..4699.869 rows=1 loops=1)
Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))
Rows Removed by Filter: 375369

SubPlan 1

-> Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4)
(actual time=0.005..171.350 rows=7414 loops=1)
Filter: ((event_type)::text = 'type_1'::text)

Rows Removed by Filter: 1099436

SubPlan 2

-> Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4)
(actual time=0.006..219.092 rows=373298 loops=1)
Filter: ((event_type)::text = 'type_2'::text)

Rows Removed by Filter: 733552

SubPlan 3

-> Seq Scan on event u1_2 (cost=0.00..45175.75 rows=111 width=4)
(actual time=0.040..3389.550 rows=712952 loops=1)
Filter: (upper(response) ~~ '%SUCCESS%'::text)

Rows Removed by Filter: 393898

The main thing that sticks out to me for this plan is the low estimate for
the rows it will return on the %SUCCESS% filter.

Here is the second query with explain:

SELECT *
FROM "lead"
WHERE
(
NOT
(
("lead"."id" IN
(
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" ='type_1'
)
OR
("lead"."id" IN
(
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" = 'type_2
)
AND "lead"."id" IN
(
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE successful
)
)
)
)
AND NOT ("lead"."ReferenceNumber" = '')
) ;

explain result:
Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369)

Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))
SubPlan 1

-> Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4)

Filter: ((event_type)::text = 'type_1'::text)

SubPlan 2

-> Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4)

Filter: ((event_type)::text = 'type_2'::text)

SubPlan 3

-> Materialize (cost=0.00..46154.43 rows=731185 width=4)

-> Seq Scan on event u1_2 (cost=0.00..39641.50 rows=731185
width=4)
Filter: successful

Here is does a materialize and estimates rows properly, but as stated this
query just hangs and pegs load. There are no locks and its in an active
state the whole time. I am running these queries in a test environment on
a recently exported full schema from production, with a reindex and a
vacuum/analyze. This is postgres 9.3.6 on rhel6.

When I run just the different subquery element:
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1

WHERE successful;

it returns in about 250ms, with the text field %SUCCESS% it runs in about 4
seconds. This seemed like a low hanging fruit query improvement so I'm
surprised its not working, it seems like we are just lucky that the planner
is estimating that filter incorrectly in the original form.

I'm sure the query just needs to be completely overhauled and am starting
to pull it apart and work with the engineers to get something more
efficient set up overall, but I am not sure how to answer the question as
to why this original attempt at improving the query is not successful.

Any guidance is greatly appreciated, thanks!

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Graeme B. Bell 2015-07-07 15:53:43 Re: New server: SSD/RAID recommendations?
Previous Message Heikki Linnakangas 2015-07-07 14:59:49 Re: New server: SSD/RAID recommendations?