Re: Speeding up text_position_next with multibyte encodings

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speeding up text_position_next with multibyte encodings
Date: 2018-12-23 00:26:11
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14/12/2018 20:20, John Naylor wrote:
> I signed up to be a reviewer, and I will be busy next month, so I went
> ahead and fixed the typo in the patch that broke assert-enabled
> builds. While at it, I standardized on the spelling "start_ptr" in a
> few places to match the rest of the file. It's a bit concerning that
> it wouldn't compile with asserts, but the patch was written by a
> committer and seems to work.

Thanks for fixing that!

> On 10/19/18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> This is a win in most cases. One case is slower: calling position() with
>> a large haystack input, where the match is near the end of the string.
> I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
loop counts so that the runtimes are reasonable, now that it actually
does some work.


length master patch
short 2.7s 10.9s
med 2.6s 11.4s
long 3.9s 9.9s


length master patch
short 2.2s 7.5s
med 2.2s 3.5s
long 3.4s 3.6s

This doesn't look very flattering for the patch. Although, this is
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.
But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

- Heikki

Attachment Content-Type Size
single-byte-bmh-benchmark.patch text/x-patch 904 bytes
single-byte-bmh-benchmark.sql application/sql 624 bytes

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2018-12-23 00:28:58 Re: Speeding up text_position_next with multibyte encodings
Previous Message Simon Riggs 2018-12-22 19:15:02 Re: Joins on TID