Unicode normalization SQL functions

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Unicode normalization SQL functions
Date: 2019-12-12 11:46:21
Message-ID: c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are patches to add support for Unicode normalization into SQL, per
SQL standard:

normalize($string [, form])
$string is [form] normalized

(comment about silly SQL syntax here)

We already have all the infrastructure for Unicode normalization for the
SASLprep functionality. The first patch extends the internal APIs to
support all four normal forms instead of only NFKC used by SASLprep.
The second patch adds the SQL layer on top of it.

This could be used to preprocess or check strings before using them with
deterministic collations or locale implementations that don't deal with
non-NFC data correctly, perhaps using triggers, generated columns, or
domains. The NFKC and NFKD normalizations could also be used for
general data cleaning, similar to what SASLprep does.

As a future idea, I think we could also hook Unicode normalization into
the protocol-level encoding conversion.

Also, there is a way to optimize the "is normalized" test for common
cases, described in UTR #15. For that we'll need an additional data
file from Unicode. In order to simplify that, I would like my patch
"Add support for automatically updating Unicode derived files"
integrated first.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v1-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patch text/plain 369.9 KB
v1-0002-Add-SQL-functions-for-Unicode-normalization.patch text/plain 18.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Suraj Kharage 2019-12-12 12:32:49 Re: backup manifests
Previous Message Amit Kapila 2019-12-12 11:41:33 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions