"like" vs "substring" again

From: Christian Schröder <cs(at)deriva(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: "like" vs "substring" again
Date: 2007-09-15 14:26:10
Message-ID: 46EBEB82.6050302@deriva.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi list,

last week I asked a question about a query with several joins and a
"like" operator which was really slow. When I replaced "like" with
"substring" (which was possible because the comparison was simply "bla
like '123%'") the query became extremely faster because the query
optimizer came to another plan.

Gregory Stark gave me the hint (thanks, Gregory!) that the query
optimizer made wrong assumptions about the selectivity of the "like".
When I used "substring" the assumptions became better and so it chose a
better (faster) plan. I then increased the statistics target of the
column and the query with "like" became as fast as when I used
"substring". So far, so good.

Now I have a similar problem: I have a query (which doesn't look too
complicated to me) which takes about 4.5 hours on a 2 GHz Dual-Core Xeon
machine. The query joins several tables and has two comparisons, one
with "not like", the other with "substring(...) = ". When I use "like"
and "not like" together or "substring(...) = " and "substring(...) <> "
together, the query takes about 5 seconds. The plan is identical in both
cases and different to the plan when I mix the comparisons. The most
obvious difference is the number of rows the query optimizer expects to
get from the table which is filtered: It expects 1 if I mix the
comparison operators and 84 if I consistently use "like" or "substring".
The real number of selected rows is 1667 (from a total of 2884 rows), so
both estimations are rather wrong. Note that this is exactly the same
column for which I increased the statistics target to 500 after last
week's discussion ...

I then set up a test table with the problematic column and filled it
with the same data. The test table looks as follows:

Table "pg_temp_3.temp"
Column | Type | Modifiers
--------+---------------+-----------
test | character(10) | not null

I set the statistics target to 1000 for this column and ran the
following queries:

explain analyze select * from temp where test like '11%' and test not like '113%';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on "temp" (cost=0.00..62.26 rows=39 width=14) (actual time=0.012..1.229 rows=1678 loops=1)
Filter: ((test ~~ '11%'::text) AND (test !~~ '113%'::text))
Total runtime: 1.655 ms
(3 rows)

explain analyze select * from temp where substring(test from 1 for 2) = '11' and substring(test from 1 for 3) <> '113';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on "temp" (cost=0.00..91.10 rows=14 width=14) (actual time=0.020..3.282 rows=1678 loops=1)
Filter: (("substring"((test)::text, 1, 2) = '11'::text) AND ("substring"((test)::text, 1, 3) <> '113'::text))
Total runtime: 3.719 ms
(3 rows)

explain analyze select * from temp where substring(test from 1 for 2) = '11' and test not like '113%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on "temp" (cost=0.00..76.68 rows=1 width=14) (actual time=0.018..2.469 rows=1678 loops=1)
Filter: (("substring"((test)::text, 1, 2) = '11'::text) AND (test !~~ '113%'::text))
Total runtime: 2.914 ms
(3 rows)

As far as I understand, all queries are semantically identical and have
the same result set. However, the query optimizer makes very different
estimations about the number of rows the queries would return. All the
estimations are far from reality, and at least the last one leads to
fatal results when this "where" clause is part of a more complex query.

So I have the following questions:

1. Why does the query optimizer not recognize that the expressions
are equivalent?
2. What can I do to improve the estimation of the query optimizer? I
tried to create an index (with opclass "bpchar_pattern_ops") which
was actually used in the first query, but did not improve the
estimation or the execution speed.

Thanks again for any help!

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-Böckler-Straße 2 http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2007-09-15 14:47:12 Re: strange TIME behaviour
Previous Message rihad 2007-09-15 13:53:26 Re: strange TIME behaviour