Re: Speeding up text_position_next with multibyte encodings

From: John Naylor <jcnaylor(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speeding up text_position_next with multibyte encodings
Date: 2018-12-14 18:20:46
Message-ID: CAJVSVGUVe9G6yUb3F-YrC_myrj_nqhagYrpmP3OT8sUjzvAuaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/30/18, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Unfortunately, patch doesn't compile anymore due:
>
> varlena.c: In function ‘text_position_next_internal’:
> varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
> function)
> Assert(start_ptr >= haystack && start_ptr <= haystack_end);
>
> Could you please send an updated version? For now I'm moving it to the next
> CF.

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

On 10/19/18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> This is a win in most cases. One case is slower: calling position() with
> a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short: 1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long: 250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time

I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.

UTF-8
length master patch
short 2.26s 2.51s
med 2.28s 2.54s
long 3.07s 3.11s

EUC-KR
length master patch
short 2.29s 2.37s
med 2.29s 2.36s
long 1.75s 1.71s

With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.

-John Naylor

Attachment Content-Type Size
v2-0001-Use-single-byte-Boyer-Moore-Horspool-search-even-.patch text/x-patch 24.6 KB
single-byte-bmh.sql application/sql 569 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-14 18:25:29 Re: Ryu floating point output patch
Previous Message Robert Haas 2018-12-14 18:16:30 Re: removal of dangling temp tables