Re: Pre-proposal: unicode normalized text

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Nico Williams <nico(at)cryptonector(dot)com>, Isaac Morland <isaac(dot)morland(at)gmail(dot)com>, Chapman Flack <chap(at)anastigmatix(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-06 22:30:00
Message-ID: CAEze2WipFK6Xrg6Kz0ndt6MSk3GF3LnarNHMJrm=A7dBmYWjnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 6 Oct 2023, 21:08 Jeff Davis, <pgsql(at)j-davis(dot)com> wrote:

> On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> > What I think people really want is a whole column in
> > some encoding that isn't the normal one for that database.
>
> Do people really want that? I'd be curious to know why.
>

One reason someone would like this is because a database cluster may have
been initialized with something like --no-locale (thus getting defaulted to
LC_COLLATE=C, which is desired behaviour and gets fast strcmp operations
for indexing, and LC_CTYPE=SQL_ASCII, which is not exactly expected but can
be sufficient for some workloads), but now that the data has grown they
want to use utf8.EN_US collations in some of their new and modern table's
fields?
Or, a user wants to maintain literal translation tables, where different
encodings would need to be used for different languages to cover the full
script when Unicode might not cover the full character set yet.
Additionally, I'd imagine specialized encodings like Shift_JIS could be
more space efficient than UTF-8 for e.g. japanese text, which might be
useful for someone who wants to be a bit more frugal with storage when they
know text is guaranteed to be in some encoding's native language:
compression can do the same work, but also adds significant overhead.

I've certainly experienced situations where I forgot to explicitly include
the encoding in initdb --no-locale and then only much later noticed that my
big data load is useless due to an inability to create UTF-8 collated
indexes.
I often use --no-locale to make string indexing fast (locales/collation are
not often important to my workload) and to block any environment variables
from being carried over into the installation. An ability to set or update
the encoding of columns would help reduce the pain: I would no longer have
to re-initialize the database or cluster from 0.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-10-07 00:09:34 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Bohdan Mart 2023-10-06 22:23:41 Re: Where can I find the doxyfile?