Re: Better locale-specific-character-class handling for regexps

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Better locale-specific-character-class handling for regexps
Date: 2016-09-04 21:17:53
Message-ID: 13600.1473023873@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I got tired of hearing complaints about the issue described in
> this thread:
> https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us
> Here's a proposed fix. I've not done extensive performance testing,
> but it seems to be as fast or faster than the old code in cases where
> there are not too many "large" characters in the input. And, more
> to the point, it gets the right answer for such large characters.

I've now done some performance testing on this patch. I ran the
attached scripts in cassert-off builds, using en_US.utf8 locale on
a RHEL6 box. Each number is the best of 3 runs; all times in ms.

The first two test cases are meant to exercise regex compile speed; they
provide relatively short input strings, and each call provides a slightly
different pattern so as to defeat the compiled-pattern cache in
adt/regexp.c.

HEAD patched

compilespeed.sql 61959.793 32936.773
compilespeed09.sql 6411.957 5903.211

compilespeed.sql exercises compiling '\w' which of course is locale
dependent, while compilespeed09.sql compiles '[0-9]' which is not.
Since the patch sets MAX_SIMPLE_CHR to 0x7FF which is the same as
regc_pg_locale.c's old cutoff for how many characters to examine,
we must be doing about the same number of compile-time iswalnum()
probes in both old and new code. The compile speedup must come entirely
from the fact that the new code uses a simple array for the low-char-codes
colormap, whereas the old code has a complicated multilevel colormap that
is more expensive to update. Then the bigger win for compilespeed.sql
must be due to the fact that \w will require changing the colors of more
characters in the colormap. I was not expecting to see such a big win,
but I'll take it ;-)

The remaining cases hold the pattern fixed (so it's not recompiled
each time) and are meant to measure the per-character scan speed
while executing a compiled regexp.

HEAD patched ratio of per-char times

runspeedlo.sql 12951.588 12053.533 0.91
runspeedhi.sql 10410.140 * 36678.876 2.72 (estimated)
runspeednotlo.sql 12999.629 12130.439 0.91
runspeednothi.sql 15678.125 20053.861 1.36
dummy.sql 3444.696 3437.984 (to measure overhead)

* gives wrong answer

In the runspeedhi.sql case, HEAD thinks the first character doesn't match
the pattern, so it's not scanning the whole input, which makes that
measurement not useful. We can assume though that the scan rate would
have been about the same as runspeednothi.sql, and the estimated per-char
ratio shown in the table is computed on that basis.

So the patched code's scan speed is about 9% better than HEAD for data
characters below MAX_SIMPLE_CHR, presumably because of the simpler color
map structure. But it's noticeably worse for data chars above that, and
much worse if the pattern forces runtime iswalnum() tests. That's what
I was expecting.

It's worth noticing that comparing runspeednotlo.sql and runspeednothi.sql
shows that HEAD is 28% worse per-character on the latter. Those cases are
exactly the same so far as HEAD's regex engine is concerned, so I think
that that differential must be blamable entirely on the fact that we're
converting 2-byte UTF8 sequences into pg_wchars in the first case and
3-byte UTF8 sequences in the second case. The patched code is also paying
that cost of course, so you should take it into account when comparing
numbers in different rows of this table. And it's also an indication
that really, this code is pretty frickin fast --- the total time per
character is only circa 4x the extra logic to pick up and shift in
one more byte!

Overall I'm pretty happy with these results. I think that most cases
will be as fast or faster than before, assuming that most data characters
are below MAX_SIMPLE_CHR for most use-cases. In the worst cases, the
per-character cost can be several times what it was before, but those
are exactly the same cases where *we gave the wrong answer* before, so
it's hard to complain too much.

regards, tom lane

Attachment Content-Type Size
compilespeed.sql text/plain 150 bytes
compilespeed09.sql text/plain 167 bytes
runspeedlo.sql text/plain 215 bytes
runspeedhi.sql text/plain 215 bytes
runspeednotlo.sql text/plain 217 bytes
runspeednothi.sql text/plain 217 bytes
dummy.sql text/plain 156 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Christian Convey 2016-09-04 21:22:38 Obsolete TODO item "-Wcast-align" ?
Previous Message Emre Hasegeli 2016-09-04 17:59:41 Re: [PATCH] Alter or rename enum value