Re: NOT LIKE much faster than LIKE?

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Matteo Beccati <php(at)beccati(dot)com>, Andrea Arcangeli <andrea(at)cpushare(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: NOT LIKE much faster than LIKE?
Date: 2006-01-12 00:48:36
Message-ID: 1137026916.21025.585.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I think its OK to use the MCV, but I have a problem with the current
> > heuristics: they only work for randomly generated strings, since the
> > selectivity goes down geometrically with length.
>
> We could certainly use a less aggressive curve for that. You got a
> specific proposal?

non-left anchored LIKE is most likely going to be used with
unstructured, variable length data - else we might use SUBSTRING
instead. My proposal would be to assume that LIKE is acting on human
language text data.

I considered this a while back, but wrote it off in favour of dynamic
sampling - but it's worth discussing this to see whether we can improve
on things without that.

Here's one of the links I reviewed previously:
http://www.ling.lu.se/persons/Joost/Texts/studling.pdf
Sigurd et al [2004]

This shows word frequency distribution peaks at 3 letter/2 phoneme
words, then tails off exponentially after that.

Clearly when search string > 3 then the selectivity must tail off
exponentially also, since we couldn't find words shorter than the search
string itself. The search string might be a phrase, but it seems
reasonable to assume that phrases also drop off in frequency according
to length. It is difficult to decide what to do at len=2 or len=3, and I
would be open to various thoughts, but would default to keeping
like_selectivity as it is now.

Sigurd et al show that word length tails off at 0.7^Len beyond Len=3, so
selectivity FIXED_CHAR_SEL should not be more than 0.7, but I see no
evidence for it being as low as 0.2 (from the published results). For
simplicity, where Len > 3, I would make the tail off occur with factor
0.5, rather than 0.2.

We could see a few more changes from those results, but curbing the
aggressive tail off would be a simple and easy act.

Best Regards, Simon Riggs

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Glaesemann 2006-01-12 01:26:58 Re: indexes on primary and foreign keys
Previous Message Mark Liberman 2006-01-12 00:41:20 Stable function being evaluated more than once in a single query