Re: badly calculated width of emoji in psql

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Jacob Champion <pchampion(at)vmware(dot)com>
Cc: "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-08-12 06:41:48
Message-ID: CAFj8pRCuaHgxa8M_FxmhGDg46qWp4yF-Xz_Xkj5NZT-F9fkQ1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion(at)vmware(dot)com>
napsal:

> On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:
> > I note that the doc comment for ucs_wcwidth()...
> >
> > > * - Spacing characters in the East Asian Wide (W) or East Asian
> > > * FullWidth (F) category as defined in Unicode Technical
> > > * Report #11 have a column width of 2.
> >
> > ...doesn't match reality anymore. The East Asian width handling was
> > last updated in 2006, it looks like? So I wonder whether fixing the
> > code to match the comment would not only fix the emoji problem but also
> > a bunch of other non-emoji characters.
>
> Attached is my attempt at that. This adds a second interval table,
> handling not only the emoji range in the original patch but also
> correcting several non-emoji character ranges which are included in the
> 13.0 East Asian Wide/Fullwidth sets. Try for example
>
> - U+2329 LEFT POINTING ANGLE BRACKET
> - U+16FE0 TANGUT ITERATION MARK
> - U+18000 KATAKANA LETTER ARCHAIC E
>
> This should work reasonably well for terminals that depend on modern
> versions of Unicode's EastAsianWidth.txt to figure out character width.
> I don't know how it behaves on BSD libc or Windows.
>
> The new binary search isn't free, but my naive attempt at measuring the
> performance hit made it look worse than it actually is. Since the
> measurement function was previously returning an incorrect (too short)
> width, we used to get a free performance boost by not printing the
> correct number of alignment/border characters. I'm still trying to
> figure out how best to isolate the performance changes due to this
> patch.
>
> Pavel, I'd be interested to see what your benchmarks find with this
> code. Does this fix the original issue for you?
>

This patch fixed badly formatted tables with emoji.

I checked this patch, and it is correct and a step forward, because it
dynamically sets intervals of double wide characters, and the code is more
readable.

I checked and performance, and although there is measurable slowdown, it is
negligible in absolute values. Previous code was a little bit faster - it
checked less ranges, but was not fully correct and up to date.

The patching was without problems
There are no regress tests, but I am not sure so they are necessary for
this case.
make check-world passed without problems

I'll mark this patch as ready for committer

Regards

Pavel

>
> --Jacob
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2021-08-12 06:45:29 Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash
Previous Message Amit Kapila 2021-08-12 06:16:09 Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o