From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Przemysław Sztoch <przemyslaw(at)sztoch(dot)pl> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com> |
Subject: | Re: [PATCH] Completed unaccent dictionary with many missing characters |
Date: | 2022-07-14 05:41:31 |
Message-ID: | Ys+siw2VEuyXdS4B@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 05, 2022 at 09:24:49PM +0200, Przemysław Sztoch wrote:
> I do not add more, because they probably concern older languages.
> An alternative might be to rely entirely on Unicode decomposition ...
> However, after the change, only one additional Ukrainian letter with an
> accent was added to the rule file.
Hmm. I was wondering about the decomposition part, actually. How
much would it make things simpler if we treat the full range of the
cyrillic characters, aka from U+0400 to U+4FF, scanning all of them
and building rules only if there are decompositions? Is it worth
considering the Cyrillic supplement, as of U+0500-U+052F?
I was also thinking about the regression tests, and as unaccent
characters are more spread than for Latin and Greek, it could be a
good thing to have a complete coverage. We could for example use a
query like that to check if a character is treated properly or not:
SELECT chr(i.a) = unaccent(chr(i.a))
FROM generate_series(1024, 1327) AS i(a); -- range of Cyrillic.
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2022-07-14 05:56:32 | Re: Handle infinite recursion in logical replication setup |
Previous Message | David Rowley | 2022-07-14 05:30:56 | Re: Skip partition tuple routing with constant partition key |