Skip site navigation (1) Skip section navigation (2)

Re: Problem (bug?) with like

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Problem (bug?) with like
Date: 2001-12-29 05:08:12
Message-ID: 200112290508.fBT58Cn14715@candle.pha.pa.us (view raw or flat)
Thread:
Lists: pgsql-generalpgsql-hackers
> The problem here is that the planner is being way too optimistic about
> the selectivity of LIKE '%DAVID%' --- notice the estimate that only
> one matching row will be found in cliente, rather than 54 as with '%DA%'.
> So it chooses a plan that avoids the sort overhead needed for an
> efficient merge join with the other tables.  That would be a win if
> there were only one matching row, but as soon as there are lots, it's
> a big loss, because the subquery to join the other tables gets redone
> for every matching row :-(
> 
> >> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
> 
> >  1)   2672 rows    -> 3.59 sec.
> >  2)   257 rows     -> 364.69 sec.
> 
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified.  Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters.  That might be true in a random
> world, but people don't search for random strings.  I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?

Let's use the above example with the new FIXED_CHAR_SEL values:

With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID
above:
	
DA	1) 0.20 ^ 2
	        .04
	
DAVID	2) 0.20 ^ 5
	        .00032

If we divide these two, we get:

	> 0.04 / 0.00032
	        125

while looking at the total counts reported above, we get:
	
	> 2672 / 257
	        ~10.39688715953307392996

The 0.04 value gives a value of:
	
	> 0.04 ^ 2       
	        .0016
	> 0.04 ^ 5       
	        .0000001024
	> .0016 / .0000001024
	        15625

Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x
too large.  Because this was a contrived example, and because some have
more random text than DAVID in their field, I think 0.20 is the proper
value.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman(at)candle(dot)pha(dot)pa(dot)us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

In response to

pgsql-hackers by date

Next:From: Bruce MomjianDate: 2001-12-29 05:10:12
Subject: Re: text -> time cast problem
Previous:From: Bruce MomjianDate: 2001-12-29 04:55:19
Subject: Re: Problem (bug?) with like

pgsql-general by date

Next:From: Bruce MomjianDate: 2001-12-29 05:30:15
Subject: Re: I wrote a program to migrate Interbase -> PostgreSQL
Previous:From: Bruce MomjianDate: 2001-12-29 04:55:19
Subject: Re: Problem (bug?) with like

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group