| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Andriy Dorokhin <andreumdorokhinum(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: RFC: Boyer-Moore-Horspool optimization for LIKE '%pattern%' searches |
| Date: | 2026-06-10 05:50:14 |
| Message-ID: | CAApHDvpubPLE6N1vfVpXzqZ2Z4=S8P_f0bxtygUi28RGwqHz2w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, 26 May 2026 at 21:54, Andriy Dorokhin
<andreumdorokhinum(at)gmail(dot)com> wrote:
> I'm a newbie and would like to explore whether Boyer-Moore-Horspool
> substring searching could improve performance for LIKE queries.
>
> This topic has come up a few times previously on pgsql-hackers.
> As I understand it, the main concerns raised previously were:
> * BMH is not universally faster, especially for short patterns
I don't recall the numbers now, but there were certainly regressions
from using B-M-H for shorter search patterns. You can see how the
overhead of that has been minimised by looking at
text_position_setup(). You can see that the skip table array will
increasingly share elements between multiple characters that match
after applying the skiptablemask as the search length becomes smaller.
That was done to save having to initialise the entire skip table
array, which was too costly for smaller searches. Unfortunately, I
didn't post benchmark results onto the mailing list for this.
> * preprocessing costs may outweigh benefits
> * LIKE semantics become more complicated when '%' and '_' appear
> internally
> * rewriting LIKE into strpos() is not always appropriate
> * SIMD/vectorized approaches may outperform BMH on modern CPUs
It certainly would be good to revisit this. I didn't ever look at SIMD
for this purpose. It was 18 years ago that I worked on the B-M-H
patch. It's likely overdue for someone to check what we could make
better using more modern methods.
> Since PostgreSQL already uses a Boyer-Moore-like approach in strpos(), I
> did some benchmarks:
> https://medium.com/@andreumdorokhinum/does-postgres-need-the-boyer-moore-horspool-search-algorithm-for-like-operator-00b43e4b115c
>
> Of course, these are only preliminary measurements.
>
> What should I focus next?
I think it would be good to go over which sort of LIKE patterns you
could make this work for and invent a design that works for those
patterns. For example, we know that we could use it for patterns such
as '%word%', but more complex patterns are also possible. For
example, '%word1%word2%' could be done. There would need to be an
additional check that 'word2' comes after 'word1', or only begin the
search after the end of 'word1'. You likely would also want to support
leading and trailing searches as it might surprise users if
'%word1%word2%' was fast, but if you removed the leading or trailing
%, it suddenly fell back to the slow search. So I suspect you'd want
to have 3 search types; leading, contains, and trailing, and then
generate a search key by preprocessing the search pattern, turning
that into something like an array of structs on what the search should
do. You'd need to be able to feed search results from previous structs
into following structs so that the search for 'word2' knows that it
must only start the search at a certain point. I think if you have
the ability to start searches at a certain point, then you can maybe
do '_' searches too.
> * short-pattern regressions?
Certainly, you will need to check for this. You'd obviously need to
write the code first and then you might have to take measures to
mitigate regressions on smaller strings.
> * UTF-8 / multibyte behavior?
I don't know about that part. I only ever did this for single-byte
chars. I believe Heikki did the multi-byte part. I don't think it's
unreasonable to start focusing on single-byte first.
> At the moment I'm mainly interested in understanding whether the
> community believes this is still worth exploring, and if not, what the
> primary blockers are likely to be.
I believe it's more useful to speed up LIKE than it ever was to speed
up POSITION(). You're not going to get any pushback from anyone who
thinks "LIKE is fast enough already". The key here is the balance
between introduced code complexity and maintainability vs observed
performance benefits. For example, if you add 1000 lines of code and
we get a 1% speedup, then you'll probably fail to get this accepted.
You should aim to make that ratio as favourable as possible. I.e,
simple as possible and well-documented code that gives a meaningful
performance improvement.
If I were doing this today, I'd probably start by figuring out which
patterns we can support, then write a benchmarking script which aims
to test a decent amount of text and patterns of various lengths. I
expect testing text lengths starting with 1 character, then double
that in powers of 2 up to maybe 2048. Maybe you can have a small,
medium and long search key within each of those and then code up
several tests within each. For example, "[no] match, leading", "[no]
match within", "[no] match trailing", and then aggregate the results
by { search text len, key_len }. You should be able to graph those
results showing the performance increase as a percentage (or
regression). You'll find having that script very useful as you tweak
the patch to iron out regressions for smaller searches.
David
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Michael Paquier | 2026-06-10 05:40:41 | Re: Reject negative max_retention_duration values |