Re: badly calculated width of emoji in psql

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Jacob Champion <pchampion(at)vmware(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: badly calculated width of emoji in psql
Date: 2021-08-25 20:15:34
Message-ID: CAFBsxsH5ejH4-1xaTLpSK8vWoK1m6fA1JBtTM6jmBsLfmDki1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion(at)vmware(dot)com> wrote:
>
> Does there need to be any sanity check for overlapping ranges between
> the combining and fullwidth sets? The Unicode data on a dev's machine
> would have to be broken somehow for that to happen, but it could
> potentially go undetected for a while if it did.

It turns out I should have done that to begin with. In the Unicode data, it
apparently happens that a character can be both combining and wide, and
that will cause ranges to overlap in my scheme:

302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
ENTERING TONE MARK

{0x3000, 0x303E, 2},
{0x302A, 0x302D, 0},

3099..309A;W # Mn [2] COMBINING KATAKANA-HIRAGANA VOICED SOUND
MARK..COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

{0x3099, 0x309A, 0},
{0x3099, 0x30FF, 2},

Going by the above, Jacob's patch from July 21 just happened to be correct
by chance since the combining character search happened first.

It seems the logical thing to do is revert my 0001 and 0002 and go back to
something much closer to Jacob's patch, plus a big comment explaining that
the order in which the searches happen matters.

The EastAsianWidth.txt does have combining property "Mn" in the comment
above, so it's tempting to just read that (plus we could read just one file
for these properties). However, it seems risky to rely on comments, since
their presence and format is probably less stable than the data format.
--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-08-25 20:33:05 Re: log_autovacuum in Postgres 14 -- ordering issue
Previous Message Justin Pryzby 2021-08-25 19:29:51 Re: Autovacuum on partitioned table (autoanalyze)