Quick Links

Re: [PATCH] Completed unaccent dictionary with many missing characters

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Przemysław Sztoch <przemyslaw(at)sztoch(dot)pl>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject:	Re: [PATCH] Completed unaccent dictionary with many missing characters
Date:	2022-07-05 07:22:19
Message-ID:	YsPmq/1BMQryVT05@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:
> Well, the addition of cyrillic does not make necessary the removal of
> SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
> dictionnary when manipulating the set of codepoints, but that's me
> being too picky. Just to say that I am fine with what you are
> proposing here.

So, I have been looking at the change for cyrillic letters, and are
you sure that the range of codepoints [U+0410,U+044f] is right when it
comes to consider all those letters as plain letters? There are a
couple of characters that itch me a bit with this range:
- What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I
(U+0439)? Shouldn't U+0439 be translated to U+0438 and U+0419
translated to U+0418? That's what I get while looking at
UnicodeData.txt, and it would mean that the range of plain letters
should not include both of them.
- It seems like we are missing a couple of letters after U+044F, like
U+0454, U+0456 or U+0455 just to name three of them?

I have extracted from 0001 and applied the parts about the regression
tests for degree signs, while adding two more for SOUND RECORDING
COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to
'x', while it should be probably 'H'.
--
Michael

In response to

Re: [PATCH] Completed unaccent dictionary with many missing characters at 2022-06-28 05:14:53 from Michael Paquier

Responses

Re: [PATCH] Completed unaccent dictionary with many missing characters at 2022-07-05 19:24:49 from Przemysław Sztoch

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dagfinn Ilmari Mannsåker	2022-07-05 07:31:27	Re: [PATCH] Add result_types column to pg_prepared_statements view
Previous Message	Drouvot, Bertrand	2022-07-05 07:17:03	Re: Patch proposal: New hooks in the connection path