Fix quadratic performance of regexp match/split functions

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fix quadratic performance of regexp match/split functions
Date: 2018-08-13 03:32:17
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While poking at that xml issue I tried to compare the memory usage of
the xml queries with vaguely comparable regexp_split_to_table queries,
and found that I could not do so; the regexp queries simply did not
complete in any sensible timeframe.

Investigation showed that in UTF8 (or other multibyte encodings), the
performance of regexp_matches and regexp_split_to_* is subject to
quadratic slowdowns thanks to the use of character offsets and calls to
text_substr, which must scan the string from the start each time to
count characters.

This patch changes that by keeping the wide-char copy of the string
available rather than freeing it, and converting the required substrings
back to the db encoding via a pre-allocated conversion buffer.

This gives noticable speedups on even very small strings (all timings
below are with -O2 --disable-cassert):

select count(*)
from (select 'aaaaa,aaaaa,aaaaa'::text as content
from generate_series(1,1000000) i offset 0) s,
regexp_split_to_table(content, ',') r;

-- ~10% faster with patch: 2.8 s -> 2.5 s

but on strings of even modest size (~1kb) the improvement is vast:

select count(*)
from (select repeat(repeat('a',10) || ',', 100+(i*0))::text
as content -- 1100 bytes, 101 matches
from generate_series(1,100000) i offset 0) s,
regexp_split_to_table(content, ',') r;

-- over 8 times faster: 51.8 sec -> 6.3 sec

and it only gets bigger:

select count(*)
from regexp_split_to_table(repeat('aaa,',10000), ',') r; -- 40KB

-- 270 times faster: 1628ms -> 6ms

This patch passes regression but I haven't yet tested it on complex
multibyte data (or non-UTF8 encodings but that shouldn't matter).


Andrew (irc:RhodiumToad)

Attachment Content-Type Size
qregex.patch text/x-patch 8.4 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-08-13 03:52:20 Re: Explain buffers wrong counter with parallel plans
Previous Message Robert Haas 2018-08-13 03:03:31 Re: Allowing printf("%m") only where it actually works