Re: badly calculated width of emoji in psql

From: Jacob Champion <pchampion(at)vmware(dot)com>
To: "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-07-21 22:12:51
Message-ID: 2f7ca10c7847f69267685beca7b5aa92dd876259.camel@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:
> I note that the doc comment for ucs_wcwidth()...
>
> > * - Spacing characters in the East Asian Wide (W) or East Asian
> > * FullWidth (F) category as defined in Unicode Technical
> > * Report #11 have a column width of 2.
>
> ...doesn't match reality anymore. The East Asian width handling was
> last updated in 2006, it looks like? So I wonder whether fixing the
> code to match the comment would not only fix the emoji problem but also
> a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

--Jacob

Attachment Content-Type Size
ucs_wcwidth-update-Fullwidth-and-Wide-codepoint-set.patch text/x-patch 9.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-07-21 22:16:42 Re: Rename of triggers for partitioned tables
Previous Message Tom Lane 2021-07-21 21:33:19 Re: WIP: Relaxing the constraints on numeric scale