Re: badly calculated width of emoji in psql

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Jacob Champion <pchampion(at)vmware(dot)com>
Cc: "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-08-20 17:05:49
Message-ID: CAFBsxsGOCpzV7c-f3a8ADsA1n4uZ=8puCctQp+x7W0vgkv=w+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion(at)vmware(dot)com> wrote:
>
> On Thu, 2021-08-19 at 13:49 -0400, John Naylor wrote:
> > I had a couple further thoughts:
> >
> > 1. The coding historically used normal comparison and branching for
> > everything, but recently it only does that to detect control
> > characters, and then goes through a binary search (and with this
> > patch, two binary searches) for everything else. Although the
> > performance regression of the current patch seems negligible,
>
> If I'm reading the code correctly, ASCII characters don't go through
> the binary searches; they're already short-circuited at the beginning
> of mbbisearch(). On my machine that's enough for the patch to be a
> performance _improvement_ for ASCII, not a regression.

I had assumed that there would be a function call, but looking at the
assembly, it's inlined, so you're right.

> Should be && instead of ||, I think.

Yes, you're quite right. Clearly I didn't test it. :-) But given the
previous, I won't pursue this further.

> > 2. As written, the patch adds a script that's very close to an
> > existing one, and emits a new file that has the same type of contents
> > as an existing one, both of which are #included in one place. I
> > wonder if we should consider having just one script that ingests both
> > files and emits one file. All we need is for mbinterval to encode the
> > character width, but we can probably do that with a bitfield like the
> > normprops table to save space. Then, we only do one binary search.
> > Opinions?
>
> I guess it just depends on what the end result looks/performs like.
> We'd save seven hops or so in the worst case?

Something like that. Attached is what I had in mind (using real patches to
see what the CF bot thinks):

0001 is a simple renaming
0002 puts the char width inside the mbinterval so we can put arbitrary
values there
0003 is Jacob's patch adjusted to use the same binary search as for
combining characters
0004 removes the ancient limit on combining characters, so the following
works now:

SELECT U&'\+0102E1\+0102E0';
+----------+
| ?column? |
+----------+
| 𐋡𐋠 |
+----------+
(1 row)

I think the adjustments to 0003 result in a cleaner and more extensible
design, but a case could be made otherwise. The former combining table
script is a bit more complex than the sum of its former self and Jacob's
proposed new script, but just slightly.

Also, I checked the behavior of this comment that I proposed to remove
upthread:

- * - Other format characters (general category code Cf in the Unicode
- * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.

We don't handle the latter in our current setup:

SELECT U&'foo\200Bbar';
+----------+
| ?column? |
+----------+
| foobar |
+----------+
(1 row)

Not sure if we should do anything about this. It was an explicit exception
years ago in our vendored manual table, but is not labeled as such in the
official Unicode files.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v2-0004-Extend-collection-of-Unicode-combining-characters.patch application/octet-stream 4.3 KB
v2-0001-Change-references-to-unicode_combining_table-to-u.patch application/octet-stream 2.7 KB
v2-0003-Fix-display-width-of-emoji-and-other-codepoints.patch application/octet-stream 11.6 KB
v2-0002-Allow-callers-mbbisearch-to-get-an-explicit-chara.patch application/octet-stream 11.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-08-20 17:15:21 Re: The Free Space Map: Problems and Opportunities
Previous Message Robert Haas 2021-08-20 17:04:58 Re: archive status ".ready" files may be created too early