|From:||Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>|
|Subject:||Re: Speeding up text_position_next with multibyte encodings|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <3173d989-bc1c-fc8a-3b69-f24246f73876(at)iki(dot)fi>
> Attached is a patch to speed up text_position_setup/next(), in some
> common cases with multibyte encodings.
> text_position_next() uses the Boyer-Moore-Horspool search algorithm,
> with a skip table. Currently, with a multi-byte encoding, we first
> convert both input strings to arrays of wchars, and run the algorithm
> on those arrays.
> This patch removes the mb->wchar conversion, and instead runs the
> single-byte version of the algorithm directly on the inputs, even with
> multi-byte encodings. That avoids the mb->wchar conversion altogether,
> when there is no match. Even when there is a match, we don't need to
> convert the whole input string to wchars. It's enough to count the
> characters up to the match, using pg_mblen(). And when
> text_position_setup/next() are used as part of split_part() or
> replace() functions, we're not actually even interested in the
> character position, so we can skip that too.
Sounds reasonable. Partial matching character is just
ignored. The conversion is simply useless.
> Avoiding the large allocation is nice, too. That was actually why I
> stated to look into this: a customer complained that text_position
> fails with strings larger than 256 MB.
> Using the byte-oriented search on multibyte strings might return false
> positives, though. To make that work, after finding a match, we verify
> that the match doesn't fall in the middle of a multi-byte
> character. However, as an important special case, that cannot happen
> with UTF-8, because it has the property that the byte sequence of one
> character never appears as part of another character. I think some
> other encodings might have the same property, but I wasn't sure.
Yes. That happens for at least EUC_JP, where the same byte can
appear in arbitrary position. Maybe we can hard code to restrict
that to UTF-8 for the first step. (I'm not sure the second step
> This is a win in most cases. One case is slower: calling position()
> with a large haystack input, where the match is near the end of the
> string. Counting the characters up to that point is slower than the
> mb->wchar conversion. I think we could avoid that regression too, if
> we had a more optimized function to count characters. Currently, the
> code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(),
> the similar loop through all the characters is a tight loop within the
Coudn't we count characters while searching a match?
NTT Open Source Software Center
|Next Message||Thomas Munro||2018-10-23 04:26:37||Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process|
|Previous Message||Kyotaro HORIGUCHI||2018-10-23 03:54:08||Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2|