Re: badly calculated width of emoji in psql

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Jacob Champion <pchampion(at)vmware(dot)com>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-08-19 17:49:27
Message-ID: CAFBsxsEX8+K4fvvUEJYTqbhpj9ENenWwYm9SrMXZ1P2cPG8P+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:

> On Sun, Aug 15, 2021 at 10:45 PM Michael Paquier <michael(at)paquier(dot)xyz>
wrote:
> >
> > On Thu, Aug 12, 2021 at 05:13:31PM -0400, John Naylor wrote:
> > > I'm a bit concerned about the build dependencies not working right,
but
> > > it's not clear it's even due to the patch. I'll spend some time
> > > investigating next week.
> >
> > How large do libpgcommon deliverables get with this patch? Skimming
> > through the patch, that just looks like a couple of bytes, still.
>
> More like a couple thousand bytes. That's because the width of mbinterval
doubled. If this is not desirable, we could teach the scripts to adjust the
width of the interval type depending on the largest character they saw.

For src/common/libpgcommon.a, in a non-assert / non-debug build:
master: 254912
patch: 256432

And if I go further and remove the limit on the largest character in the
combining table, I get 257248, which is still a relatively small difference.

I had a couple further thoughts:

1. The coding historically used normal comparison and branching for
everything, but recently it only does that to detect control characters,
and then goes through a binary search (and with this patch, two binary
searches) for everything else. Although the performance regression of the
current patch seems negligible, we could use almost the same branches to
fast-path printable ascii text, like this:

+ /* fast path for printable ASCII characters */
+ if (ucs >= 0x20 || ucs < 0x7f)
+ return 1;
+
/* test for 8-bit control characters */
if (ucs == 0)
return 0;

- if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
+ if (ucs < 0xa0 || ucs > 0x0010ffff)
return -1;

2. As written, the patch adds a script that's very close to an existing
one, and emits a new file that has the same type of contents as an existing
one, both of which are #included in one place. I wonder if we should
consider having just one script that ingests both files and emits one file.
All we need is for mbinterval to encode the character width, but we can
probably do that with a bitfield like the normprops table to save space.
Then, we only do one binary search. Opinions?

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-08-19 17:57:58 Re: reporting TID/table with corruption error
Previous Message Mark Dilger 2021-08-19 17:49:09 Re: reporting TID/table with corruption error