Re: Speeding up text_position_next with multibyte encodings

From: John Naylor <jcnaylor(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speeding up text_position_next with multibyte encodings
Date: 2019-01-15 00:52:42
Message-ID: CAJVSVGWKEShqDeKiQi8jdBBEpXfsM7p7DaML=bEDsK3cPWm=gA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 23, 2018 at 9:33 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> So, what is the expected speedup in a "good/average" case? Do we have
> some reasonable real-world workload mixing these cases that could be
> used as a realistic benchmark?

Not sure about a realistic mix, but I investigated the tradeoffs.
First, using the test function Heikki posted upthread [1], I tried a 2
character needle needle with different haystack sizes, and looked for
the position where master and patch were roughly equal (average of 3,
within 1%). Beyond this point, the patch regresses. To keep it simple
I used UTF-8 only.

haystack position
<=23 (patch always faster)
30 23
100 58
1000 ~550
1000000 ~550000

For very short haystacks, the patch is always faster or regresses only
when the needle is close to the end. For longer haystacks, the patch
will be faster if the position searched for is less than ~55% of the
way to the end, slower if after. Next, I tested finding the position
of a 2 character needle in a couple positions towards the front of a
1000 character haystack, plus not present and at the end for
comparison. As seen earlier [1], the worst-case regression for this
haystack length was 4.4x slower for UTF-8, but that had a skip table
collision, and this time I used characters with no bytes in common.
Average of 3, with a loop count of 1,000,000:

UTF-8
pos. master patch diff
* 3880ms 144ms -96%
100 4440ms 1410ms -68%
333 5700ms 4280ms -25%
999 9370ms 12500ms 34%

EUC-KR
pos. master patch diff
* 3900ms 112ms -97%
100 4410ms 1289ms -71%
333 5680ms 3970ms -30%
999 9370ms 11600ms 24%

The patch is much faster than master when the needle is near the front
of a large haystack or not there at all.

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

[1] https://www.postgresql.org/message-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d%40iki.fi
--

-John Naylor

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-01-15 00:55:45 Re: "SELECT ... FROM DUAL" is not quite as silly as it appears
Previous Message Tsunakawa, Takayuki 2019-01-15 00:50:23 RE: O_DIRECT for relations and SLRUs (Prototype)