Re: badly calculated width of emoji in psql

From: Jacob Champion <pchampion(at)vmware(dot)com>
To: "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-07-07 18:03:34
Message-ID: 9e3c847108bac6041f50e86d08d8835f5cf7cd78.camel@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:
> At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote in
> > with this patch, the formatting is correct
>
> I think the hardest point of this issue is that we don't have a
> reasonable authoritative source that determines character width. And
> that the presentation is heavily dependent on environment.

> Unicode 9 and/or 10 defines the character properties "Emoji" and
> "Emoji_Presentation", and tr51[1] says that
>
> > Emoji are generally presented with a square aspect ratio, which
> > presents a problem for flags.
> ...
> > Current practice is for emoji to have a square aspect ratio, deriving
> > from their origin in Japanese. For interoperability, it is recommended
> > that this practice be continued with current and future emoji. They
> > will typically have about the same vertical placement and advance
> > width as CJK ideographs. For example:
>
> Ok, even putting aside flags, the first table in [2] asserts that "#",
> "*", "0-9" are emoji characters. But we and I think no-one never
> present them in two-columns. And the table has many mysterious holes
> I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

> We could Emoji_Presentation=yes for the purpose, but for example,
> U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
> Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
> WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
> look like other than some kind of mistake.

That is strange.

> About environment, for example, U+23E9 is an emoji, and
> Emoji_Presentation=yes, but it is shown in one column on my
> xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

> A possible compromise is that we treat all Emoji=yes characters
> excluding ASCII characters as double-width and manually merge the
> fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

--Jacob

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Christensen 2021-07-07 18:16:36 Re: DELETE CASCADE
Previous Message Zhihong Yu 2021-07-07 18:02:31 Re: Numeric x^y for negative x