Re: Problem (bug?) with like

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: drheart(at)wanadoo(dot)es
Cc: Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Problem (bug?) with like
Date: 2001-12-04 17:16:27
Message-ID: 20011204171627.GB450@lorien.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

bombadil(at)wanadoo(dot)es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';

> Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
> -> Sort (cost=16238.44..16238.44 rows=54 width=131)
> -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131)
> -> Sort (cost=38525.06..38525.06 rows=20097 width=74)
> -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
> -> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
> ...

> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';

> Nested Loop (cost=891.91..61414.58 rows=7638 width=183)
> -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131)
> -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
> -> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
> ... [same subplan as above]

The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables. That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(

>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

> 1) 2672 rows -> 3.59 sec.
> 2) 257 rows -> 364.69 sec.

I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified. Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters. That might be true in a random
world, but people don't search for random strings. I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Hannu Krosing 2001-12-04 17:30:33 Re: [HACKERS] Problem (bug?) with like
Previous Message Bruce Momjian 2001-12-04 16:58:25 Re: plpgsql commenting broken?

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2001-12-04 17:30:33 Re: [HACKERS] Problem (bug?) with like
Previous Message Tom Lane 2001-12-04 16:59:57 Re: FW: [CYGWIN] 7.2b3 postmaster doesn't start on Win98