Re: Unicode normalization SQL functions

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode normalization SQL functions
Date: 2020-01-09 09:20:14
Message-ID: 2309023a-6f69-f049-70e5-3c70b4fb9672@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-01-06 17:00, Daniel Verite wrote:
> Peter Eisentraut wrote:
>
>> Also, there is a way to optimize the "is normalized" test for common
>> cases, described in UTR #15. For that we'll need an additional data
>> file from Unicode. In order to simplify that, I would like my patch
>> "Add support for automatically updating Unicode derived files"
>> integrated first.
>
> Would that explain that the NFC/NFKC normalization and "is normalized"
> check seem abnormally slow with the current patch, or should
> it be regarded independently of the other patch?

That's unrelated.

> For instance, testing 10000 short ASCII strings:
>
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfc normalized ;
> count
> -------
> 10000
> (1 row)
>
> Time: 2573,859 ms (00:02,574)
>
> By comparison, the NFD/NFKD case is faster by two orders of magnitude:
>
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfd normalized ;
> count
> -------
> 10000
> (1 row)
>
> Time: 29,962 ms
>
> Although NFC/NFKC has a recomposition step that NFD/NFKD
> doesn't have, such a difference is surprising.

It's very likely that this is because the recomposition calls
recompose_code() which does a sequential scan of UnicodeDecompMain for
each character. To optimize that, we should probably build a bespoke
reverse mapping table that can be accessed more efficiently.

> I've tried an alternative implementation based on ICU's
> unorm2_isNormalized() /unorm2_normalize() functions (which I'm
> currently adding to the icu_ext extension to be exposed in SQL).
> With these, the 4 normal forms are in the 20ms ballpark with the above
> test case, without a clear difference between composed and decomposed
> forms.

That's good feedback.

> Independently of the performance, I've compared the results
> of the ICU implementation vs this patch on large series of strings
> with all normal forms and could not find any difference.

And that too.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2020-01-09 09:28:21 Re: pgbench - use pg logging capabilities
Previous Message Peter Eisentraut 2020-01-09 09:16:19 Re: Add support for automatically updating Unicode derived files