Re: Speeding up text_position_next with multibyte encodings

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, John Naylor <jcnaylor(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speeding up text_position_next with multibyte encodings
Date: 2018-12-23 14:33:23
Message-ID: 30dc0ed3-cc30-abda-427d-ac6e8305525b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/23/18 1:26 AM, Heikki Linnakangas wrote:
> On 14/12/2018 20:20, John Naylor wrote:
>> I signed up to be a reviewer, and I will be busy next month, so I went
>> ahead and fixed the typo in the patch that broke assert-enabled
>> builds. While at it, I standardized on the spelling "start_ptr" in a
>> few places to match the rest of the file. It's a bit concerning that
>> it wouldn't compile with asserts, but the patch was written by a
>> committer and seems to work.
>
> Thanks for fixing that!
>
>> On 10/19/18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> This is a win in most cases. One case is slower: calling position() with
>>> a large haystack input, where the match is near the end of the string.
>>
>> I wanted to test this case in detail, so I ran the attached script, ...
>
> I'm afraid that script doesn't work as a performance test. The
> position() function is immutable, so the result gets cached in the plan
> cache. All you're measuring is the speed to get the constant from the
> plan cache :-(.
>
> I rewrote the same tests with a little C helper function, attached, to
> fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
> loop counts so that the runtimes are reasonable, now that it actually
> does some work.
>
> UTF-8
>
> length        master        patch
> short        2.7s        10.9s
> med        2.6s        11.4s
> long        3.9s        9.9s
>
> EUC_KR
>
> length        master        patch
> short        2.2s        7.5s
> med        2.2s        3.5s
> long        3.4s        3.6s
>
> This doesn't look very flattering for the patch. Although, this is
> deliberately testing the worst case.
>
> You chose interesting characters for the UTF-8 test. The haystack is a
> repeating pattern of byte sequence EC 99 84, and the needle is a
> repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
> skip table are '2', '1' and '250'. But the search bounces between the
> '2' and '1' cases, and never hits the byte that would allow it to skip
> 250 bytes. Interesting case, I had not realized that that can happen.
> But I don't think we need to put much weight on that, you could come up
> with other scenarios where the current code has skip table collisions, too.
>
> So overall, I think it's still a worthwhile tradeoff, given that that is
> a worst case scenario. If you choose less pathological UTF-8 codepoints,
> or there is no match or the match is closer to the beginning of the
> string, the patch wins.
>

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-12-23 14:59:22 Re: CF app feature request
Previous Message Magnus Hagander 2018-12-23 12:09:31 Re: CF app feature request