From: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | Jacob Champion <pchampion(at)vmware(dot)com> |
Cc: | "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com> |
Subject: | Re: badly calculated width of emoji in psql |
Date: | 2021-08-12 06:41:48 |
Message-ID: | CAFj8pRCuaHgxa8M_FxmhGDg46qWp4yF-Xz_Xkj5NZT-F9fkQ1w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion(at)vmware(dot)com>
napsal:
> On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:
> > I note that the doc comment for ucs_wcwidth()...
> >
> > > * - Spacing characters in the East Asian Wide (W) or East Asian
> > > * FullWidth (F) category as defined in Unicode Technical
> > > * Report #11 have a column width of 2.
> >
> > ...doesn't match reality anymore. The East Asian width handling was
> > last updated in 2006, it looks like? So I wonder whether fixing the
> > code to match the comment would not only fix the emoji problem but also
> > a bunch of other non-emoji characters.
>
> Attached is my attempt at that. This adds a second interval table,
> handling not only the emoji range in the original patch but also
> correcting several non-emoji character ranges which are included in the
> 13.0 East Asian Wide/Fullwidth sets. Try for example
>
> - U+2329 LEFT POINTING ANGLE BRACKET
> - U+16FE0 TANGUT ITERATION MARK
> - U+18000 KATAKANA LETTER ARCHAIC E
>
> This should work reasonably well for terminals that depend on modern
> versions of Unicode's EastAsianWidth.txt to figure out character width.
> I don't know how it behaves on BSD libc or Windows.
>
> The new binary search isn't free, but my naive attempt at measuring the
> performance hit made it look worse than it actually is. Since the
> measurement function was previously returning an incorrect (too short)
> width, we used to get a free performance boost by not printing the
> correct number of alignment/border characters. I'm still trying to
> figure out how best to isolate the performance changes due to this
> patch.
>
> Pavel, I'd be interested to see what your benchmarks find with this
> code. Does this fix the original issue for you?
>
This patch fixed badly formatted tables with emoji.
I checked this patch, and it is correct and a step forward, because it
dynamically sets intervals of double wide characters, and the code is more
readable.
I checked and performance, and although there is measurable slowdown, it is
negligible in absolute values. Previous code was a little bit faster - it
checked less ranges, but was not fully correct and up to date.
The patching was without problems
There are no regress tests, but I am not sure so they are necessary for
this case.
make check-world passed without problems
I'll mark this patch as ready for committer
Regards
Pavel
>
> --Jacob
>
From | Date | Subject | |
---|---|---|---|
Next Message | Drouvot, Bertrand | 2021-08-12 06:45:29 | Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash |
Previous Message | Amit Kapila | 2021-08-12 06:16:09 | Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o |