Re: How well does PostgreSQL 9.6.1 support unicode?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: james(at)360data(dot)ca
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How well does PostgreSQL 9.6.1 support unicode?
Date: 2016-12-21 07:56:37
Message-ID: 20161221.165637.246733544.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

At Tue, 20 Dec 2016 16:41:51 -0800, James Zhou <james(at)360data(dot)ca> wrote in <CAGuREpPHJmoHe_5+P25UCosRvqQpbhPF_0LGFbJ+xYgUKndydg(at)mail(dot)gmail(dot)com>
> Unicode has evolved from version 1.0 with 7,161 characters released in 1991
> to version 9.0 with 128,172 characters released in June 2016. My questions
> are
> - which version of Unicode is supported by PostgreSQL 9.6.1?
> - what does "supported" exactly mean? simply store it? comparison? sorting?
> substring? etc.
...
> /* characters from BMP, 0000 - FFFF */
> insert into unicode(id, string) values(1, U&'\0041'); -- 'A'
...
> insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of two Chinese characters

These shouldn't be a problem.

> /* Below are unicode characters with code points beyond FFFF, aka planes 1 - F */
> insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo character, https://unicodelookup.com/#0x1f478/1

https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html

> Unicode characters can be specified in escaped form by writing a
> backslash followed by the four-digit hexadecimal code point
> number or alternatively a backslash followed by a plus sign
> followed by a six-digit hexadecimal code point number.

So this is parsed as U+1f47 + '8' as you seen. This should be as
the following. '+' is needed just after the backslash.

insert into unicode(id, string) values(100, U&'\+01F478');

The six-digit form accepts up to U+10FFFF so the whole space in
Unicode is usable.

> Observations
>
> - BMP characters (id <= 10)
> - they are stored and fetched correctly.
> - their lengths in char are correct, although some of them take 3
> bytes (id = 4, 6, 7)
> - *But their sorting order seems to be undefined. Can anyone comment
> the sorting rules?*
> - Non-BMP characters (id >= 100)
> - they take 2 - 4 bytes.
> - Their lengths in character are not correct
> - they are not retrieved correctly, judged by the their fetched ascii
> value (column 5 in the table above)
> - substring is not correct

>
> Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is
> causing a problem in my application.

'+' would resolve the problem.

> My conclusion:
> - PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP. Is
> there any documents defining which subset is fully supported?

A PostgreSQL database with encoding=UTF8 just accepts the whole
range of Unicode, regardless that a character is defined for the
code or not.

> Are any configuration I can change so that more unicode characters are
> supported?

For the discussion on sorting, categorize is described in Tom's
mail.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Yogesh Sharma 2016-12-21 08:59:49
Previous Message James Zhou 2016-12-21 07:17:56 Re: How well does PostgreSQL 9.6.1 support unicode?