| From: | Andriy Dorokhin <andreumdorokhinum(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | RFC: Boyer-Moore-Horspool optimization for LIKE '%pattern%' searches |
| Date: | 2026-05-26 09:54:21 |
| Message-ID: | 88272f23-19b4-493d-bdd7-258218b74881@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi everyone,
I'm a newbie and would like to explore whether Boyer-Moore-Horspool
substring searching could improve performance for LIKE queries.
This topic has come up a few times previously on pgsql-hackers.
As I understand it, the main concerns raised previously were:
* BMH is not universally faster, especially for short patterns
* preprocessing costs may outweigh benefits
* LIKE semantics become more complicated when '%' and '_' appear
internally
* rewriting LIKE into strpos() is not always appropriate
* SIMD/vectorized approaches may outperform BMH on modern CPUs
Still, I was curious whether the practical gains for longer literal
substring searches might justify revisiting the idea.
Since PostgreSQL already uses a Boyer-Moore-like approach in strpos(), I
did some benchmarks:
https://medium.com/@andreumdorokhinum/does-postgres-need-the-boyer-moore-horspool-search-algorithm-for-like-operator-00b43e4b115c
Of course, these are only preliminary measurements.
What should I focus next?
* effects across different collations/locales?
* short-pattern regressions?
* UTF-8 / multibyte behavior?
At the moment I'm mainly interested in understanding whether the
community believes this is still worth exploring, and if not, what the
primary blockers are likely to be.
Thanks,
Andriy
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2026-05-26 10:02:45 | Re: Avoid leaking system path from pg_available_extensions |
| Previous Message | Jim Jones | 2026-05-26 09:46:44 | Re: [PATCH] Add CANONICAL option to xmlserialize |