Re: Unicode grapheme clusters

From: Greg Stark <stark(at)mit(dot)edu>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode grapheme clusters
Date: 2023-01-20 00:37:48
Message-ID: CAM-w4HMTeJ9nwd_9Ohvaka8qNQ8s0Xw=-URaCP5MCe2buDwHcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is how we've always documented it. Postgres treats code points as
"characters" not graphemes.

You don't need to go to anything as esoteric as emojis to see this either.
Accented characters like é have no canonical forms that are multiple code
points and in some character sets some accented characters can only be
represented that way.

But I don't think there's any reason to consider changing e existing
functions. They have to be consistent with substr and the other string
manipulation functions.

We could add new functions to work with graphemes but it might bring more
pain keeping it up to date....

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-01-20 00:37:52 Re: [PATCH] Teach planner to further optimize sort in distinct
Previous Message Peter Geoghegan 2023-01-20 00:17:00 Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation