Re: badly calculated width of emoji in psql

From: Jacob Champion <pchampion(at)vmware(dot)com>
To: "john(dot)naylor(at)enterprisedb(dot)com" <john(dot)naylor(at)enterprisedb(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-08-24 17:50:50
Message-ID: b51c5abaf17a46bf82027cd833cfc91ab6d45aad.camel@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote:
> On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion(at)vmware(dot)com> wrote:
> > I guess it just depends on what the end result looks/performs like.
> > We'd save seven hops or so in the worst case?
>
> Something like that. Attached is what I had in mind (using real
> patches to see what the CF bot thinks):
>
> 0001 is a simple renaming
> 0002 puts the char width inside the mbinterval so we can put arbitrary values there

0002 introduces a mixed declarations/statements warning for
ucs_wcwidth(). Other than that, LGTM overall.

> --- a/src/common/wchar.c
> +++ b/src/common/wchar.c
> @@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s)
>
> struct mbinterval
> {
> - unsigned short first;
> - unsigned short last;
> - signed short width;
> + unsigned int first;
> + unsigned int last:21;
> + signed int width:4;
> };

Oh, right -- my patch moved mbinterval from short to int, but should I
have used uint32 instead? It would only matter in theory for the
`first` member now that the bitfields are there.

> I think the adjustments to 0003 result in a cleaner and more
> extensible design, but a case could be made otherwise. The former
> combining table script is a bit more complex than the sum of its
> former self and Jacob's proposed new script, but just slightly.

The microbenchmark says it's also more performant, so +1 to your
version.

Does there need to be any sanity check for overlapping ranges between
the combining and fullwidth sets? The Unicode data on a dev's machine
would have to be broken somehow for that to happen, but it could
potentially go undetected for a while if it did.

> Also, I checked the behavior of this comment that I proposed to remove upthread:
>
> - * - Other format characters (general category code Cf in the Unicode
> - * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
>
> We don't handle the latter in our current setup:
>
> SELECT U&'foo\200Bbar';
> +----------+
> | ?column? |
> +----------+
> | foobar |
> +----------+
> (1 row)
>
> Not sure if we should do anything about this. It was an explicit
> exception years ago in our vendored manual table, but is not labeled
> as such in the official Unicode files.

I'm wary of changing too many things at once, but it does seem like we
should be giving that codepoint a width of 0.

On Tue, 2021-08-24 at 12:05 -0400, John Naylor wrote:
> I plan to commit my proposed v2 this week unless I hear reservations
> about my adjustments, or bikeshedding. I'm thinking of squashing 0001
> and 0002.

+1

Thanks!
--Jacob

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-08-24 18:16:38 Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
Previous Message Stephen Frost 2021-08-24 17:46:28 Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)