Re: [PATCH] Completed unaccent dictionary with many missing characters

From: Przemysław Sztoch <przemyslaw(at)sztoch(dot)pl>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: [PATCH] Completed unaccent dictionary with many missing characters
Date: 2022-07-05 19:24:49
Message-ID: 4c9326a1-6554-262f-1f22-e636933086ed@sztoch.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Michael Paquier wrote on 7/5/2022 9:22 AM:
> On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:
>> Well, the addition of cyrillic does not make necessary the removal of
>> SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
>> dictionnary when manipulating the set of codepoints, but that's me
>> being too picky. Just to say that I am fine with what you are
>> proposing here.
> So, I have been looking at the change for cyrillic letters, and are
> you sure that the range of codepoints [U+0410,U+044f] is right when it
> comes to consider all those letters as plain letters? There are a
> couple of characters that itch me a bit with this range:
> - What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I
> (U+0439)? Shouldn't U+0439 be translated to U+0438 and U+0419
> translated to U+0418? That's what I get while looking at
> UnicodeData.txt, and it would mean that the range of plain letters
> should not include both of them.
1. It's good that you noticed it. I missed it. But it doesn't affect the
generated rule list.
> - It seems like we are missing a couple of letters after U+044F, like
> U+0454, U+0456 or U+0455 just to name three of them?
2. I added a few more letters that are used in languages other than
Russian: Byelorussian or Ukrainian.

-                       (0x0410, 0x044f),      # Cyrillic capital and
small letters
+ (0x0402, 0x0402),      # Cyrillic capital and small letters
+ (0x0404, 0x0406),      #
+ (0x0408, 0x040b),      #
+ (0x040f, 0x0418),      #
+ (0x041a, 0x0438),      #
+ (0x043a, 0x044f),      #
+ (0x0452, 0x0452),      #
+ (0x0454, 0x0456),      #

I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an
accent was added to the rule file.
>
> I have extracted from 0001 and applied the parts about the regression
> tests for degree signs, while adding two more for SOUND RECORDING
> COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to
> 'x', while it should be probably 'H'.
3. The matter is not that simple. When I change priorities (ie
Latin-ASCII.xml is less important than Unicode decomposition),
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...

If you decide what to do with point 3, I will correct it and send new
patches.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-07-05 19:26:00 Re: Emit extra debug message when executing extension script.
Previous Message Tom Lane 2022-07-05 19:17:38 Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load