Re: Unicode normalization SQL functions

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Peter Eisentraut" <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: "Andreas Karlsson" <andreas(at)proxel(dot)se>,"pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode normalization SQL functions
Date: 2020-02-17 19:08:00
Message-ID: 2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've checked the v3 patch against the results of the normalization
done by ICU [1] on my test data again, and they're identical
(as they were with v1; v2 had the bug discussed upthread, now fixed).

Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.

Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
To illustrate this, here are the execution times reported by psql for
the queries below exercising the normalization code, both with the
functions provided by the patch and with the equivalent functions
implemented with ICU.
The dataset is ~10 million unique short strings
extracted from real data, and the number is a median execution time in
millisecs, for 10 successive runs with query parallelism off
(stddev in parentheses).

operation | core | icu
------------+--------------+-----------
nfc check | 4398 (20) | 3088 (27)
nfc conv | 771502 (414) | 5503 (19)
nfd check | 4510 (10) | 2898 (8)
nfd conv | 9102 (1) | 5569 (6)
nfkc check | 4825 (51) | 3273 (4)
nfkc conv | 772240 (340) | 5763 (8)
nfkd check | 4794 (4) | 3170 (39)
nfkd conv | 9229 (4) | 5824 (9)

The queries:

check w/core:
select count(*) from words where w is $NORM normalized;

conversion w/core:
select sum(length(normalize(w, $NORM))) from words;

check w/icu:
select count(*) from words where icu_is_normalized(w, '$NORM');

conversion w/icu:
select sum(length(icu_normalize(w, '$NORM'))) from words;

[1] https://github.com/dverite/icu_ext/blob/master/icu_normalize.c

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2020-02-17 19:14:03 Re: Unicode normalization SQL functions
Previous Message Haumacher, Bernhard 2020-02-17 18:01:38 Re: Error on failed COMMIT