speed up unicode normalization quick check

From: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: speed up unicode normalization quick check
Date: 2020-05-21 07:12:06
Message-ID: CACPNZCt4fbJ0_bGrN5QPt34N4whv=mszM0LMVQdoa2rC9UMRXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is a patch to use perfect hashing to speed up Unicode
normalization quick check.

0001 changes the set of multipliers attempted when generating the hash
function. The set in HEAD works for the current set of NFC codepoints,
but not for the other types. Also, the updated multipliers now all
compile to shift-and-add on most platform/compiler combinations
available on godbolt.org (earlier experiments found in [1]). The
existing keyword lists are fine with the new set, and don't seem to be
very picky in general. As a test, it also successfully finds a
function for the OS "words" file, the "D" sets of codepoints, and for
sets of the first n built-in OIDs, where n > 5.

0002 builds on top of the existing normprops infrastructure to use a
hash function for NFC quick check. Below are typical numbers in a
non-assert build:

select count(*) from (select md5(i::text) as t from
generate_series(1,100000) as i) s where t is nfc normalized;

HEAD 411ms 413ms 409ms
patch 296ms 297ms 299ms

The addition of "const" was to silence a compiler warning. Also, I
changed the formatting of the output file slightly to match pgindent.

0003 uses hashing for NFKC and removes binary search. This is split
out for readability. I gather NFKC is a less common use case, so this
could technically be left out. Since this set is larger, the
performance gains are a bit larger as well, at the cost of 19kB of
binary space:

HEAD 439ms 440ms 442ms
patch 299ms 301ms 301ms

I'll add this to the July commitfest.

[1] https://www.postgresql.org/message-id/CACPNZCuVTiLhxAzXp9uCeHGUyHMa59h6_pmP+_W-SzXG0UyY9w@mail.gmail.com

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v1-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch application/octet-stream 1.4 KB
v1-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch application/octet-stream 21.0 KB
v1-0003-Use-perfect-hashing-for-NFKC-Unicode-normalizatio.patch application/octet-stream 63.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-05-21 07:17:16 Re: Planning counters in pg_stat_statements (using pgss_store)
Previous Message Amit Kapila 2020-05-21 06:53:56 Re: WIP/PoC for parallel backup