Re: [HACKERS] Problem (bug?) with like

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Problem (bug?) with like
Date: 2001-12-28 18:27:26
Message-ID: 200112281827.fBSIRRk28377@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> > > different from the first the our optimizer may be fine with that. Is it
> > > only when the strings get longer that we lose specificity?
> >
> > Yeah, I don't think that the estimates are bad for one or two
> > characters. But the estimate gets real small real fast as you
> > increase the number of match characters in the LIKE pattern.
> > We need to slow that down some.
>
> Yea, maybe a log base 2 decrease:
>
> 1 char 1x
> 2 char 2x
> 4 char 3x
> 8 char 4x
> 16 char 5x

I did a little research on this. I think the problem is that ordinary
characters are assumed to randomly appear in a character string, while
in practice, if the string has already been specified like 'DAV', there
are very few additional characters that can follow it and make sense.

Looking at backend/utils/adt/selfuncs.c, I see this:

#define FIXED_CHAR_SEL 0.04 /* about 1/25 */
...
sel *= FIXED_CHAR_SEL;

which means every additional character reduces the selectivity by 96%.
This seems much too restrictive to me. Because of the new optimizer
buckets, we do have good statistics on the leading character, but
additional characters drastically reduce selectivity. I think perhaps a
number like 0.50 or 50% may be correct.

That would be a table like this:

1 char 2x
2 char 4x
4 char 8x
8 char 16x
16 char 32x

which is more restrictive than I initially suggested above but less
restrictive than we have now.

Should we assume additional characters are indeed randomly appearing in
the string?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2001-12-28 18:28:55 Re: First decent PostgreSQL CBT now on techdocs.postgresql.org
Previous Message Tom Lane 2001-12-28 18:21:19 Re: First decent PostgreSQL CBT now on techdocs.postgresql.org

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-12-28 18:35:41 Re: Problem (bug?) with like
Previous Message Bruce Momjian 2001-12-28 17:57:53 Re: TODO question