Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: craig(at)2ndquadrant(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL fails to convert decomposed utf-8 to other encodings
Date: 2014-08-06 04:37:28
Message-ID: 20140806.133728.1438896689235492206.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> On 08/06/2014 09:14 AM, Tom Lane wrote:
>> We don't actually support "decomposed" utf8; if there is any bug here,
>> it's that the input you show isn't rejected. But I think there was
>> some intentional choice to not check \u escapes fully.
>
> Combining characters (i.e. decomposed utf-8 form, for chars where there
> is a combined equivalent) are part of utf-8. They're not an optional add-on.
>
> So if Pg doesn't support them, it doesn't fully support utf-8. Which is
> fine as far as it goes, but must be documented as a limitation at
> minimum. (I'll deal with that).
>
> It also means that you get fun anomalies like:
>
> regress=> SELECT 'á' = 'á';
> ?column?
> ----------
> f
> (1 row)
>
> which is IMO insane.
>
> Not only that, but we can't reject decomposed forms, because they will
> already exist in live installs. That'd break dump and reload of such
> installs and cause exciting problems with pg_upgrade.
>
> The "we'll just reject part of utf-8" opportunity has flown. It needs to
> be documented as a bug in existing versions, and I guess given that I'm
> the one complaining I get to see if I can find a sane fix for 9.5...

I'm not sure what you mean by decomposed utf8 because there's no such
a thing in the Unicode standard. Maybe you mean "composite character"
or "precomposed character"?

Anywa in my understanding to handle composite characters, we should do
"Unicode normalization" in the first place. There's 4 types of
normalization:

NFD (Normalization Form Canonical Decomposition)
NFC (Normalization Form Canonical Composition)
NFKD (Normalization Form Compatibility Decomposition)
NFKC (Normalization Form Compatibility Composition)

I don't know how we could implement one of these without major
performance degradation.

Also some composite characters can be decomposed but after composed
again, they do not return to the original form of composite characters
(round trip conversion is impossible). Such characters are called
"Composition Exclusion" (see
http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt).
I have no idea how to deal with the issue.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Sandeep Thakkar 2014-08-06 11:04:09 Re: Re: BUG #11039: installation fails when trying to install C++ redistributable
Previous Message Craig Ringer 2014-08-06 04:12:05 Re: PostgreSQL fails to convert decomposed utf-8 to other encodings