Re: badly calculated width of emoji in psql

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Jacob Champion <pchampion(at)vmware(dot)com>
Cc: "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-07-23 15:42:20
Message-ID: CAFj8pRD5NxUt435zn5dg_yEpDhV6r-20b0+tXH=gzTHUthKR5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion(at)vmware(dot)com>
napsal:

> On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:
> > I note that the doc comment for ucs_wcwidth()...
> >
> > > * - Spacing characters in the East Asian Wide (W) or East Asian
> > > * FullWidth (F) category as defined in Unicode Technical
> > > * Report #11 have a column width of 2.
> >
> > ...doesn't match reality anymore. The East Asian width handling was
> > last updated in 2006, it looks like? So I wonder whether fixing the
> > code to match the comment would not only fix the emoji problem but also
> > a bunch of other non-emoji characters.
>
> Attached is my attempt at that. This adds a second interval table,
> handling not only the emoji range in the original patch but also
> correcting several non-emoji character ranges which are included in the
> 13.0 East Asian Wide/Fullwidth sets. Try for example
>
> - U+2329 LEFT POINTING ANGLE BRACKET
> - U+16FE0 TANGUT ITERATION MARK
> - U+18000 KATAKANA LETTER ARCHAIC E
>
> This should work reasonably well for terminals that depend on modern
> versions of Unicode's EastAsianWidth.txt to figure out character width.
> I don't know how it behaves on BSD libc or Windows.
>
> The new binary search isn't free, but my naive attempt at measuring the
> performance hit made it look worse than it actually is. Since the
> measurement function was previously returning an incorrect (too short)
> width, we used to get a free performance boost by not printing the
> correct number of alignment/border characters. I'm still trying to
> figure out how best to isolate the performance changes due to this
> patch.
>
> Pavel, I'd be interested to see what your benchmarks find with this
> code. Does this fix the original issue for you?
>

I can confirm that the original issue is fixed.

I tested performance

I had three data sets

1. typical data - mix ascii and utf characters typical for czech language -
25K lines - there is very small slowdown 2ms from 24 to 26ms (stored file
of this result has 3MB)

2. the worst case - this reports has only emoji 1000 chars * 10K rows -
there is more significant slowdown - from 160 ms to 220 ms (stored file has
39MB)

3. a little bit of obscure datasets generated by \x and select * from
pg_proc - it has 99K lines - and there are a lot of unicode decorations
(borders). The line has 17K chars. (the stored file has 1.7GB)
In this dataset I see a slowdown from 4300 to 4700 ms.

In all cases, the data are in memory (in filesystem cache). I tested load
to pspg.

9% looks too high, but in absolute time it is 400ms for 99K lines and very
untypical data, or 2ms for more typical results., 2ms are nothing (for
interactive work). More - this is from a pspg perspective. In psql there
can be overhead of network, protocol processing, formatting, and more and
more, and psql doesn't need to calculate display width of decorations
(borders), what is the reason for slowdowns in pspg.

Pavel

> --Jacob
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-07-23 15:50:09 Re: WIP: Relaxing the constraints on numeric scale
Previous Message Ronan Dunklau 2021-07-23 15:12:25 Re: Showing applied extended statistics in explain