Re: [PATCH] Completed unaccent dictionary with many missing characters

From: Przemysław Sztoch <przemyslaw(at)sztoch(dot)pl>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: [PATCH] Completed unaccent dictionary with many missing characters
Date: 2022-06-20 08:37:57
Message-ID: d0e93206-d7cd-37b8-a79b-7ddf73457e02@sztoch.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Michael Paquier wrote on 20.06.2022 03:49:
> On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote:
>> Two fixes (bad comment and fixed Latin-ASCII.xml).
> if codepoint.general_category.startswith('L') and \
> - len(codepoint.combining_ids) > 1:
> + len(codepoint.combining_ids) > 0:
> So, this one checks for the case where a codepoint is within the
> letter category. As far as I can see this indeed adds a couple of
> characters, with a combination of Greek and Latin letters. So that
> looks fine.
Previously, there were only multi-letter conversions. Now we also have
single letters.
>
> + elif codepoint.general_category.startswith('N') and \
> + len(codepoint.combining_ids) > 0 and \
> + args.noLigaturesExpansion is False and is_ligature(codepoint, table):
> + charactersSet.add((codepoint.id,
> + "".join(chr(combining_codepoint.id)
> + for combining_codepoint
> + in get_plain_letters(codepoint, table))))
> And this one is for the numerical part of the change. Do you actually
> need to apply is_ligature() here? I would have thought that this only
> applies to letters.
But ligature check is performed on combining_ids (result of
translation), not on base codepoint.
Without it, you will get assertions in get_plain_letters.

The idea is that we take translations that turn into normal letters.
Others (strange) are rejected.
Maybe it could be done better. I didn't like it as much as you did, but
I couldn't do better.
In the end, I left it just like in the original script.

Note that the plain letter list (PLAIN_LETTER_RANGES) has now been
expanded with numbers.
> - assert(False)
> + assert False, 'Codepoint U+%0.2X' % codepoint.id
> [...]
> - assert(is_ligature(codepoint, table))
> + assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id
> These two are a good idea for debugging.
>
> - return all(is_letter(table[i], table) for i in codepoint.combining_ids)
> + return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids)
> It looks like this makes the code weaker, as we would silently skip
> characters that are not part of the table rather than checking for
> them all the time?
Unfortunately, there are entries in combining_ids that are not in the
character table being used.
This protection is necessary so that there is no error. But unfamiliar
characters are omitted.
> While recreating unaccent.rules with your patch, I have noticed what
> looks like an error. An extra rule mapping U+210C (black-letter
> capital h) to "x" gets added on top of te existing one for "H", but
> the correct answer is the existing rule, not the one added by the
> patch.
The problem with the sign of U+210C is that there are conflicting
translations for it.
As the name suggests "(black-letter capital h)", it should be converted
to a capital H.
However, the current Latin-ASCII.xml suggests a conversion to x.
I found an open discussion on the internet about this and the suggestion
that the Latin-ASCII.xml file should be corrected for this letter.
But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml
quickly into the official repo.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2022-06-20 09:01:51 Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
Previous Message Michael Paquier 2022-06-20 07:13:43 Re: [BUG] Panic due to incorrect missingContrecPtr after promotion