RFC: Boyer-Moore-Horspool optimization for LIKE '%pattern%' searches

From: Andriy Dorokhin <andreumdorokhinum(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: RFC: Boyer-Moore-Horspool optimization for LIKE '%pattern%' searches
Date: 2026-05-26 09:54:21
Message-ID: 88272f23-19b4-493d-bdd7-258218b74881@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

I'm a newbie and would like to explore whether Boyer-Moore-Horspool
substring searching could improve performance for LIKE queries.

This topic has come up a few times previously on pgsql-hackers.
As I understand it, the main concerns raised previously were:
    * BMH is not universally faster, especially for short patterns
    * preprocessing costs may outweigh benefits
    * LIKE semantics become more complicated when '%' and '_' appear
internally
    * rewriting LIKE into strpos() is not always appropriate
    * SIMD/vectorized approaches may outperform BMH on modern CPUs

Still, I was curious whether the practical gains for longer literal
substring searches might justify revisiting the idea.

Since PostgreSQL already uses a Boyer-Moore-like approach in strpos(), I
did some benchmarks:
https://medium.com/@andreumdorokhinum/does-postgres-need-the-boyer-moore-horspool-search-algorithm-for-like-operator-00b43e4b115c

Of course, these are only preliminary measurements.

What should I focus next?
* effects across different collations/locales?
* short-pattern regressions?
* UTF-8 / multibyte behavior?

At the moment I'm mainly interested in understanding whether the
community believes this is still worth exploring, and if not, what the
primary blockers are likely to be.

Thanks,
Andriy

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2026-05-26 10:02:45 Re: Avoid leaking system path from pg_available_extensions
Previous Message Jim Jones 2026-05-26 09:46:44 Re: [PATCH] Add CANONICAL option to xmlserialize