Re: Pre-proposal: unicode normalized text

From: Nico Williams <nico(at)cryptonector(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-04 17:23:41
Message-ID: ZR2fnTPDwPb0IamC@ubby21
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> The idea is to have a new data type, say "UTEXT", that normalizes the
> input so that it can have an improved notion of equality while still
> using memcmp().

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

Maybe UTF8 might be a better name for the new type, since it would
denote the transform (and would allow for UTF16 and UTF32 some day,
though it's doubtful those would ever happen).

But it's one thing to specify Unicode (and transform) in the type and
another to specify an NF to normalize to on insert or on lookup.

How about new column constraint keywords, such as NORMALIZE (meaning
normalize on insert) and NORMALIZED (meaning reject non-canonical form
text), with an optional parenthetical by which to specify a non-default
form? (These would apply to TEXT as well when the default encoding for
the DB is UTF-8.)

One could then ALTER TABLE to add this to existing tables.

This would also make it easier to add a form-preserving/form-insensitive
mode later if it turns out to be useful or necessary, maybe making it
the default for Unicode text in new tables.

> Questions:
>
> * Would this be useful enough to justify a new data type? Would it be
> confusing about when to choose one versus the other?

Yes. See above. I think I'd rather have it be called UTF8, and the
normalization properties of it to be specified as column constraints.

> * Would cross-type comparisons between TEXT and UTEXT become a major
> problem that would reduce the utility?

Maybe when the database's encoding is UTF_8 then UTEXT (or UTF8) can be an alias
of TEXT.

> * Should "some_utext_value = some_text_value" coerce the LHS to TEXT
> or the RHS to UTEXT?

Ooh, this is nice! If the TEXT is _not_ UTF-8 then it could be
converted to UTF-8. So I think which is RHS and which is LHS doesn't
matter -- it's which is UTF-8, and if both are then the only thing left
to do is normalize, and for that I'd take the LHS' form if the LHS is
UTF-8, else the RHS'.

Nico
--

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-10-04 17:47:40 Re: Pre-proposal: unicode normalized text
Previous Message Robert Haas 2023-10-04 17:16:22 Re: Pre-proposal: unicode normalized text