May "PostgreSQL server side GB18030 character set support" reconsidered?

From: Han Parker <parker(dot)han(at)outlook(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: May "PostgreSQL server side GB18030 character set support" reconsidered?
Date: 2020-10-05 05:14:58
Message-ID: ME2PR01MB2532E72B514DC46ED0E10F798A0C0@ME2PR01MB2532.ausprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

May "GB18030 server side support" deserve reconsidering, after about 15 years later than release of GB18030-2005?
It may be the one of most green features for PostgreSQL.

1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese characters (UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon Neutral", along with Polar ice melting.
https://www.nasa.gov/feature/goddard/2020/emissions-could-add-15-inches-to-2100-sea-level-rise-nasa-led-study-finds

2."Setting client side to UTF-8, just like setting server side to UTF-8" in the following mail is not practical for most Chinese IT projects, especially public funding projects. Because GB18030 compatible is a law in Mainland China.
Usually the client side encoding configuration with a GUI is more difficult to be hidden, and most MS Windows users are familiar with GB18030.
MySQL supports GB18030 in server side from V5.7 in 2015. And I am not sure how much this feature contributed to MySQL's more popular in Mainland China.
https://dev.mysql.com/doc/mysql-g11n-excerpt/5.7/en/charset-gb18030.html

[http://www.nasa.gov/sites/default/files/thumbnails/image/getzisharbeck.jpg]<https://www.nasa.gov/feature/goddard/2020/emissions-could-add-15-inches-to-2100-sea-level-rise-nasa-led-study-finds>
Emissions could add 15 inches to 2100 sea level rise | NASA<https://www.nasa.gov/feature/goddard/2020/emissions-could-add-15-inches-to-2100-sea-level-rise-nasa-led-study-finds>
If greenhouse gas emissions continue apace, Greenland and Antarctica’s ice sheets could together contribute more than 15 inches of global sea level rise by 2100
www.nasa.gov

Parker Han

________________________________
From: pgsql-general-owner(at)postgresql(dot)org <pgsql-general-owner(at)postgresql(dot)org> on behalf of Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
Sent: Saturday, March 7, 2015 8:18
To: lsliang <lsliang(at)pconline(dot)com(dot)cn>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>; pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Re: Re: [GENERAL] can postgresql supported utf8mb4 character sets?

On Fri, Mar 6, 2015 at 3:55 AM, lsliang <lsliang(at)pconline(dot)com(dot)cn<mailto:lsliang(at)pconline(dot)com(dot)cn>> wrote:

2015-03-06

________________________________
发件人:Adrian Klaver
发送时间:2015-03-05 21:31:39
收件人:lsliang; pgsql-general
抄送:
主题:Re: [GENERAL] can postgresql supported utf8mb4 character sets?

On 03/05/2015 01:45 AM, lsliang wrote:
> can postgresql supported utf8mb4 character set?
> today mobile apps support 4-byte character and utf8 can only
> support 1-3 bytes character
The docs would seem to indicate otherwise:
http://www.postgresql.org/docs/9.3/interactive/multibyte.html
http://en.wikipedia.org/wiki/UTF-8
> if load string to database which contain a 4-byte character
> will failed .
Have you actually tried to load strings in to Postgres?
If so and it failed what was the method you used and what was the error?
> mysql since 5.5.3 support utf8mb4 character sets
> I don't find some information about postgresql
> thanks
--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com<mailto:adrian(dot)klaver(at)aklaver(dot)com>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
thanks for your help .

postgresql can support 4-byte character

test=> select * from utf8mb4_test ;
ERROR: character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "GB18030"
test=> \encoding utf8
test=> select * from utf8mb4_test ;
content
---------
😄
😄

pcauto=>

UTF-8 support works fine. The 3 byte limit was something mysql invented. But it only works if your client encoding is UTF-8. In your example, your terminal is not set to UTF-8.

create table test (glyph text);
insert into test values ('A'), ('馬'), ('𐁀'), ('😄'), ('🇪🇸');

select glyph, convert_to(glyph, 'utf-8'), length(glyph) FROM test;
glyph | convert_to | length
-------+--------------------+--------
A | \x41 | 1
馬 | \xe9a6ac | 1
𐁀 | \xf0908180 | 1
😄 | \xf09f9884 | 1
🇪🇸 | \xf09f87aaf09f87b8 | 2
(5 rows)

What doesn't work is GB18030:

select glyph, convert_to(glyph, 'GB18030'), length(glyph) FROM test;
ERROR: character with byte sequence 0xf0 0x90 0x81 0x80 in encoding "UTF8" has no equivalent in encoding "GB18030"

I think that is a bug.

Gr. Arjen

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message ZHAOWANCHENG 2020-10-05 07:58:43 which git workflow is used by pg comminuty developers?
Previous Message PegoraroF10 2020-10-04 23:37:21 Cluster and Vacuum Full