Re: Unicode grapheme clusters

From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode grapheme clusters
Date: 2023-01-24 16:40:01
Message-ID: CAM-w4HNoonCZW3p=D9J2ev7LpOKXiAsgaH-XOUV=3gL_OJMwOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 21 Jan 2023 at 13:17, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Probably our long-term answer is to avoid depending on wcwidth
> and use wcswidth instead. But it's hard to get excited about
> doing the legwork for that until popular libc implementations
> get it right.

Here's an interesting blog post about trying to do this in Rust:

https://tomdebruijn.com/posts/rust-string-length-width-calculations/

TL;DR... Even counting the number of graphemes isn't enough because
terminals typically (but not always) display emoji graphemes using two
columns.

At the end of the day Unicode kind of assumes a variable-width display
where the rendering is handled by something that has access to the
actual font metrics. So anything trying to line things up in columns
in a way that works with any rendering system down the line using any
font is going to be making a best guess.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-24 16:43:25 Re: run pgindent on a regular basis / scripted manner
Previous Message Jelte Fennema 2023-01-24 16:03:25 Re: run pgindent on a regular basis / scripted manner