| From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | Tatsuo Ishii <ishii(at)postgresql(dot)org>, Henson Choi <assam258(at)gmail(dot)com> |
| Subject: | Experimenting with wider Unicode storage |
| Date: | 2026-04-15 13:43:26 |
| Message-ID: | CA+hUKG+VEg7OsbRNbRcakp2k+078PCDhZ6HUJjvGvJ839ivxDQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
We only allow one character encoding per database. The SQL standard
and most comparable RDBMSs are more flexible, though the details vary.
In most, you can attach CHARACTER SET to column definitions, a whole
CREATE TABLE (non-standard), CREATE DOMAIN, CREATE SCHEMA and probably
more places. In some, encoding is implied by COLLATE instead
(non-standard). When different encodings meet, text is transcoded,
which works but is bad for performance.
Our single database encoding also has some restrictions: it must
encode ASCII as ASCII, and any byte that is part of a multibyte
sequence must not look like ASCII, since many code paths require that
and fixing that is hard. That excludes a few encodings that people
want.
That's all OK by now, as modern information systems use Unicode
everywhere. There is one practical problem that causes people to
complain[2] about PostgreSQL though: about half of the global
population uses a language that arbitrarily gained a byte per
character by switching to UTF-8 compared to various legacy encodings
or GB18030. That's storage and RAM that you have to pay for up front
and forever (space, I/O[3]).
I wondered about inventing new PostgreSQL-backend-compatible encodings
that swizzle bits around to make CJK/I languages fit, but I kept
coming back to Unicode.
In many systems there is a special way to use UTF-16, which gets you
back to around two bytes per character for Chinese, Japanese, Korean
and probably Indian languages. In MySQL/MariaDB, you can use various
encodings of UTF-16 as a CHARACTER SET with the normal text types, and
in Oracle, DB2, SQL Server/Sybase you can use a separate NVARCHAR type
for UTF-16, alongside the regular text types whose encodings are
controlled by CHARACTER SET or COLLATE. N* types are shorthand for
the standardese "NATIONAL <TYPE>", a string in an unspecified special
encoding, for which they all chose some kind of UTF-16.
At first I thought that with first class extension types as one of our
superpowers, we might be able to do that with an extension, but the
problem there is that text is so tangled up with locales, and I
certainly didn't want to convert all over the place. So I tried to
hack up a minimal demonstration of what a separate UTF-16 text type
might look like as a core data type.
You can save a lot of space if you have a separate "utf16" type
("national text"?), but only for East and South Asian languages:
+ language | octets | delta | string
+----------+---------+-------+-------------------------------------------------
+ English | 45→90 | +100% | In a hole in the ground there lived a hobbit.
+ Spanish | 44→86 | +95% | En un agujero en el suelo, vivía un hobbit.
+ Russian | 59→66 | +12% | В норе под землей жил-был хоббит.
+ Arabic | 57→64 | +12% | كان يعيش هوبيت في حفرة في الأرض.
+ Hebrew | 43→48 | +12% | בתוך חור באדמה חי הוביט.
+ Greek | 74→82 | +11% | Σε μια τρύπα στο έδαφος ζούσε ένα χόμπιτ.
+ Korean | 55→46 | -16% | 땅속 어느 구멍에 한 호빗이 살고 있었다.
+ Hindi | 111→86 | -23% | जमीन में बने एक गड्ढे में एक हॉबिट रहता था।
+ Tamil | 146→108 | -26% | அந்த நிலத்தில் ஒரு துளையில் ஒரு ஹாபிட்
வசித்து வந்தது.
+ Chinese | 51→34 | -33% | 在地下一个洞里,住着一个霍比特人。
+ Japanese | 66→44 | -33% | 穴のなかに、ひとりのホビットが暮らしていた。
It's actually -33% in Korean too without spaces, or if you use
double-width spaces, a more common stylistic choice IIUC (this Korean
string happens to use ASCII spaces). The same sort of thing explains
why Russian, Arabic, Hewbrew and Greek lose ~12% instead of breaking
even: they share space and some punctuation with ASCII.
The attached is highly exploratory concept code to try the idea out
and see if experts in CJK computing, defenders of the type system and
others think it might be worth exploring further. Some of the
technical challenges and observations I spotted along the way:
* "text" (etc) and "utf16" need to be comparable incrementally without
conversion
* N^2 explosions in cross-type support function definitions must
surely be avoided
* the solution to that is surely generic programming, but PostgreSQL
is written in ye olde C
* therefore, this POC patch is big on macros as poor-man's C++
* to support fast paths like memcmp()-based ucs_basic comparison you
need big endian UTF-16
* to stuff UTF-16 into varlena you need to tolerate unaligned access
* by a happy coincidence, ICU supports big endian, unaligned UTF-16
* Windows libc locales might in theory allow UTF-16, but that'd have
to be native endian, which I didn't implement
* other systems actually allow both endiannesses as encodings or
subtypes, at the user's option
* they also allow you to control whether surrogates are allowed
In this patch you can see some string iterator concepts that I have
been hacking on for an entirely different purpose, namely trying to
figure out how to make our multibyte string support go faster (and
also be safer) by hoisting all the character-at-a-time loops out to
specialisations in string handling functions. That's not shown here.
Neither is any kind of silent transcoding that plagues other systems
that do this kind of thing:
* the "utf16" type is only allowed to contain text that is a ASCII,
LATIN1 or Unicode, depending on the database encoding
* for UTF8, it's all of Unicode, for which mb_iterator can trivally
produce UTF-16 or UTF-32
* for LATIN1, that becomes trivial casting since LATIN1 is a strict
subset of Unicode
* for anything else, utf16 only allows ASCII characters, as a degraded
mode just to allow the tests to pass
In other words, so far utf16 is not allowed to represent anything that
"text" could not represent and convert trivially. That was originally
a decision to make a quick proof-of-concept plausible, but maybe it's
even a good idea...
I have no doubt that there are lots of complicated problems that I
haven't met yet, when you add more types. What I was trying to
explore here was whether you can exclude most of them by providing
enough conversion-free (incremental) cross-type support. This is
something SQL Server DBAs talk about: mixtures of NVARCHAR and VARCHAR
columns and index befuddle the planner and introduce hidden execution
costs if you're not careful. It struck me that with the above
restriction you could perhaps keep all strings cheaply and
incrementally comparable. There are still some things you can't do:
* if you're comparing "text" with "utf16" then you lose the
length-based not-equal fast path
* that's a big deal if it means detoasting
Sharing for discussion only. It has enough working to support
converting individual columns text->utf16 and use btree indexes. Many
more text functions would need to be converted to generic form, and
many more support functions would be needed for full functionality.
(Individual Indian languages could in theory be compacted even further
to single-byte ISCII. ISCII is in general infeasible as a server
encoding because it has stateful shifts between many scripts, but
single-script variants of ISCII as supported on some Unixen would in
theory be plausible. Since India has so many languages and scripts
and information systems often need to support all of them, I am
reliably informed that UTF-8 reigns supreme there despite taking 3
bytes to represent the tiny 6 bit (?) character set of any individual
language like Devanagari (Hindi etc). GNU/Linux doesn't even support
ISCII, so that idea is basically DOA.)
[1] https://www.postgresql.org/docs/current/infoschema-character-sets.html
[2] https://www.postgresql.org/message-id/flat/ME2PR01MB2532E72B514DC46ED0E10F798A0C0%40ME2PR01MB2532.ausprd01.prod.outlook.com#7d490f97a3df6dfef61e485161e72e06
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-pg_stack_alloc-Provide-API-for-stack-allocation.patch | text/x-patch | 15.6 KB |
| 0002-Provide-utf16-type.patch | text/x-patch | 94.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Langote | 2026-04-15 13:46:29 | Re: GetCachedPlan() refactor: move execution lock acquisition out |
| Previous Message | vignesh C | 2026-04-15 13:12:47 | Fix tab completion after EXCEPT (...) in IMPORT FOREIGN SCHEMA |