Re: Boyer-More-Horspool searching LIKE queries

From: Atsushi Ogawa <atsushi(dot)ogawa(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Boyer-More-Horspool searching LIKE queries
Date: 2022-01-14 14:40:44
Message-ID: CAO2GH9EgykEvn_JQ+dPXJE1Aq4bo7KUQ=AGQVhLpxKhDqgkNuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the comments.

> > The conditions under which B-M-H can be used are as follows.
> >
> > (1) B-M-H in LIKE search supports only single-byte character sets and
UTF8.
> > Multibyte character sets does not support, because it may contain
another
> > characters in the byte sequence. For UTF-8, it works fine, because in
> > UTF-8 the byte sequence of one character cannot contain another
character.
>
> You can make it work with any encoding, if you check for that case after
> you find a match. See how text_position() does it.

I saw the text_position(). I would like to do the same.

> > (3) The pattern string should be at least 4 characters.
> > For example, '%AB%' can use B-M-H.
>
> To be precise, the patch checks that the pattern string is at least 4
> *bytes* long. A pattern like E'%\U0001F418%' would benefit too.

*bytes* is precise. I will revise the comment of code.

> If I'm reading the code correctly, it doesn't account for escapes
> correctly. It will use B-M-H for a pattern like '%\\%', even though
> that's just searching for a single backslash and won't benefit from B-M-H.

You are correct. I will fix it.

> > (4) The first and last character of the pattern string should be '%'.
>
> I wonder if we can do better than that. If you have a pattern like
> '%foo%bar', its pretty obvious (to a human) that you can quickly check
> if the string ends in 'bar', and then check if it also contains the
> substring 'foo'. Is there some way to generalize that?

I think the following optimizations are possible.

(1)%foo%bar
Check if the string ends with 'bar' and search for 'foo' by B-M-H.

(2)foo%bar%
Check if the string starts with 'foo' and search for 'bar' by B-M-H.

(3)foo%bar%baz
Check if the string starts with 'foo' and string ends with 'baz' and
search for 'bar' by B-M-H.

> Looking at MatchText() in like.c, there is this piece of code:
>
> > else if (*p == '%')
> > {
> > char firstpat;
> >
> > /*
> > * % processing is essentially a search for a
text position at
> > * which the remainder of the text matches the
remainder of the
> > * pattern, using a recursive call to check each
potential match.
> > *
> > * If there are wildcards immediately following
the %, we can skip
> > * over them first, using the idea that any
sequence of N _'s and
> > * one or more %'s is equivalent to N _'s and one
% (ie, it will
> > * match any sequence of at least N text
characters). In this way
> > * we will always run the recursive search loop
using a pattern
> > * fragment that begins with a literal
character-to-match, thereby
> > * not recursing more than we have to.
> > */
> > NextByte(p, plen);
> >
> > while (plen > 0)
> > {
> > if (*p == '%')
> > NextByte(p, plen);
> > else if (*p == '_')
> > {
> > /* If not enough text left to
match the pattern, ABORT */
> > if (tlen <= 0)
> > return LIKE_ABORT;
> > NextChar(t, tlen);
> > NextByte(p, plen);
> > }
> > else
> > break; /* Reached a
non-wildcard pattern char */
> > }
>
> Could we use B-M-H to replace that piece of code?

For example, in a pattern such as %foo%bar%, it is possible to first search
for 'foo' by B-M-H, and then search for 'bar' by B-M-H. It would be nice if
such a
process could be generalized to handle various LIKE search patterns.

> How does the performance compare with regular expressions? Would it be
> possible to use this for trivial regular expressions too? Or could we
> speed up the regexp engine to take advantage of B-M-H, and use it for
> LIKE? Or something like that?

I think regular expressions in postgresql is slower than LIKE.
I compared it with the following two SQLs.

(1)LIKE: execution time is about 0.8msec
select count(*) from pg_catalog.pg_description d where d.description like
'%greater%';

(2)regular expression: execution time is about 3.1 msec
select count(*) from pg_catalog.pg_description d where d.description ~
'greater';

For trivial regular expressions, it may be better to use LIKE.

> > I have measured the performance with the following query.
>
> Setting up the B-M-H table adds some initialization overhead, so this
> would be a loss for cases where the LIKE is executed only once, and/or
> the haystack strings are very small. That's probably OK, the overhead is
> probably small, and those cases are probably not performance-critical.
> But would be nice to measure that too.

I tried to measure the case where LIKE is executed only once and
the haystack string are very small.

---------------------------------------------------------------
SET client_min_messages TO notice;

\timing

DO $$
DECLARE
cnt integer := 0;
total integer := 0;
BEGIN
FOR i IN 1..10000 LOOP
select count(*) into cnt from pg_class where oid = 2662 and relname
like '%cl%';
total := total + cnt;
END LOOP;

RAISE NOTICE 'TOTAL: %', total;
END
$$
;
---------------------------------------------------------------

without patch: 74.499msec
with patch 77.321msec

In this case, the patched version will be a few percent slower, but I think
the overhead is small.

Regards,
Atsushi Ogawa

2022年1月12日(水) 5:17 Heikki Linnakangas <hlinnaka(at)iki(dot)fi>:

> On 11/01/2022 15:55, Atsushi Ogawa wrote:
> > I have created a patch to enable the Boyer-More-Horspool search
> > algorithm (B-M-H) for LIKE queries.
>
> Cool!
>
> > The conditions under which B-M-H can be used are as follows.
> >
> > (1) B-M-H in LIKE search supports only single-byte character sets and
> UTF8.
> > Multibyte character sets does not support, because it may contain another
> > characters in the byte sequence. For UTF-8, it works fine, because in
> > UTF-8 the byte sequence of one character cannot contain another
> character.
>
> You can make it work with any encoding, if you check for that case after
> you find a match. See how text_position() does it.
>
> > (3) The pattern string should be at least 4 characters.
> > For example, '%AB%' can use B-M-H.
>
> To be precise, the patch checks that the pattern string is at least 4
> *bytes* long. A pattern like E'%\U0001F418%' would benefit too.
>
> If I'm reading the code correctly, it doesn't account for escapes
> correctly. It will use B-M-H for a pattern like '%\\%', even though
> that's just searching for a single backslash and won't benefit from B-M-H.
>
> > (4) The first and last character of the pattern string should be '%'.
>
> I wonder if we can do better than that. If you have a pattern like
> '%foo%bar', its pretty obvious (to a human) that you can quickly check
> if the string ends in 'bar', and then check if it also contains the
> substring 'foo'. Is there some way to generalize that?
>
> Looking at MatchText() in like.c, there is this piece of code:
>
> > else if (*p == '%')
> > {
> > char firstpat;
> >
> > /*
> > * % processing is essentially a search for a text
> position at
> > * which the remainder of the text matches the
> remainder of the
> > * pattern, using a recursive call to check each
> potential match.
> > *
> > * If there are wildcards immediately following
> the %, we can skip
> > * over them first, using the idea that any
> sequence of N _'s and
> > * one or more %'s is equivalent to N _'s and one
> % (ie, it will
> > * match any sequence of at least N text
> characters). In this way
> > * we will always run the recursive search loop
> using a pattern
> > * fragment that begins with a literal
> character-to-match, thereby
> > * not recursing more than we have to.
> > */
> > NextByte(p, plen);
> >
> > while (plen > 0)
> > {
> > if (*p == '%')
> > NextByte(p, plen);
> > else if (*p == '_')
> > {
> > /* If not enough text left to
> match the pattern, ABORT */
> > if (tlen <= 0)
> > return LIKE_ABORT;
> > NextChar(t, tlen);
> > NextByte(p, plen);
> > }
> > else
> > break; /* Reached a
> non-wildcard pattern char */
> > }
>
> Could we use B-M-H to replace that piece of code?
>
> How does the performance compare with regular expressions? Would it be
> possible to use this for trivial regular expressions too? Or could we
> speed up the regexp engine to take advantage of B-M-H, and use it for
> LIKE? Or something like that?
>
> > I have measured the performance with the following query.
>
> Setting up the B-M-H table adds some initialization overhead, so this
> would be a loss for cases where the LIKE is executed only once, and/or
> the haystack strings are very small. That's probably OK, the overhead is
> probably small, and those cases are probably not performance-critical.
> But would be nice to measure that too.
>
> - Heikki
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-01-14 14:44:03 Re: support for MERGE
Previous Message Amit Kapila 2022-01-14 14:35:49 Re: Consistently use the function name CreateCheckPoint instead of CreateCheckpoint in code comments