Re: badly calculated width of emoji in psql

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Jacob Champion <pchampion(at)vmware(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-08-25 17:13:08
Message-ID: CAFBsxsHhCJzn_50zhS=bBC5vEUxSzGR2vSVY9ZXHu3CUZQ+UDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion(at)vmware(dot)com> wrote:
>
> On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote:
> > On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion(at)vmware(dot)com>
wrote:
> > > I guess it just depends on what the end result looks/performs like.
> > > We'd save seven hops or so in the worst case?
> >
> > Something like that. Attached is what I had in mind (using real
> > patches to see what the CF bot thinks):
> >
> > 0001 is a simple renaming
> > 0002 puts the char width inside the mbinterval so we can put arbitrary
values there
>
> 0002 introduces a mixed declarations/statements warning for
> ucs_wcwidth(). Other than that, LGTM overall.

I didn't see that warning with clang 12, either with or without assertions,
but I do see it on gcc 11. Fixed, and pushed 0001 and 0002. I decided
against squashing them, since my original instinct was correct -- the
header changes too much for git to consider it the same file, which may
make archeology harder.

> > --- a/src/common/wchar.c
> > +++ b/src/common/wchar.c
> > @@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s)
> >
> > struct mbinterval
> > {
> > - unsigned short first;
> > - unsigned short last;
> > - signed short width;
> > + unsigned int first;
> > + unsigned int last:21;
> > + signed int width:4;
> > };
>
> Oh, right -- my patch moved mbinterval from short to int, but should I
> have used uint32 instead? It would only matter in theory for the
> `first` member now that the bitfields are there.

I'm not sure it would matter, but the usual type for codepoints is unsigned.

> > I think the adjustments to 0003 result in a cleaner and more
> > extensible design, but a case could be made otherwise. The former
> > combining table script is a bit more complex than the sum of its
> > former self and Jacob's proposed new script, but just slightly.
>
> The microbenchmark says it's also more performant, so +1 to your
> version.
>
> Does there need to be any sanity check for overlapping ranges between
> the combining and fullwidth sets? The Unicode data on a dev's machine
> would have to be broken somehow for that to happen, but it could
> potentially go undetected for a while if it did.

Thanks for testing again! The sanity check sounds like a good idea, so I'll
work on that and push soon.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-08-25 17:34:31 log_autovacuum in Postgres 14 -- ordering issue
Previous Message Justin Pryzby 2021-08-25 16:39:59 Re: [PATCH] document