Re: Speeding up text_position_next with multibyte encodings

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Speeding up text_position_next with multibyte encodings
Date: 2018-10-23 04:19:29
Message-ID: 20181023.131929.125357849.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <3173d989-bc1c-fc8a-3b69-f24246f73876(at)iki(dot)fi>
> Attached is a patch to speed up text_position_setup/next(), in some
> common cases with multibyte encodings.
>
> text_position_next() uses the Boyer-Moore-Horspool search algorithm,
> with a skip table. Currently, with a multi-byte encoding, we first
> convert both input strings to arrays of wchars, and run the algorithm
> on those arrays.
>
> This patch removes the mb->wchar conversion, and instead runs the
> single-byte version of the algorithm directly on the inputs, even with
> multi-byte encodings. That avoids the mb->wchar conversion altogether,
> when there is no match. Even when there is a match, we don't need to
> convert the whole input string to wchars. It's enough to count the
> characters up to the match, using pg_mblen(). And when
> text_position_setup/next() are used as part of split_part() or
> replace() functions, we're not actually even interested in the
> character position, so we can skip that too.

Sounds reasonable. Partial matching character is just
ignored. The conversion is simply useless.

> Avoiding the large allocation is nice, too. That was actually why I
> stated to look into this: a customer complained that text_position
> fails with strings larger than 256 MB.
>
> Using the byte-oriented search on multibyte strings might return false
> positives, though. To make that work, after finding a match, we verify
> that the match doesn't fall in the middle of a multi-byte
> character. However, as an important special case, that cannot happen
> with UTF-8, because it has the property that the byte sequence of one
> character never appears as part of another character. I think some
> other encodings might have the same property, but I wasn't sure.

Yes. That happens for at least EUC_JP, where the same byte can
appear in arbitrary position. Maybe we can hard code to restrict
that to UTF-8 for the first step. (I'm not sure the second step
happens.)

> This is a win in most cases. One case is slower: calling position()
> with a large haystack input, where the match is near the end of the
> string. Counting the characters up to that point is slower than the
> mb->wchar conversion. I think we could avoid that regression too, if
> we had a more optimized function to count characters. Currently, the
> code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(),
> the similar loop through all the characters is a tight loop within the
> function.
>
> Thoughts?

Coudn't we count characters while searching a match?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-10-23 04:26:37 Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process
Previous Message Kyotaro HORIGUCHI 2018-10-23 03:54:08 Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2