Re: BUG #18362: unaccent rules and Old Greek text

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: cees(dot)van(dot)zeeland(at)freedom(dot)nl, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18362: unaccent rules and Old Greek text
Date: 2024-02-25 23:19:53
Message-ID: ZdvLGeJ1BsXRkrdQ@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sun, Feb 25, 2024 at 04:21:36PM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form
> <noreply(at)postgresql(dot)org> wrote:
>> So, there are reasons to keep the current unaccent.rules as it is, but...
>> there are other reasons to add a few lines to it, f.e. after line 955 and
>> insert five greek vowels with Oxia
>> Please add:
>> ά α
>> έ ε
>> ή η
>> ί ι
>> ό ο
>> ύ υ
>> ώ ω

Correct me if I'm wrong of course, but reading a bit on the matter at
[1], letters with Tonos or Oxia are actually equivalent since 1986,
and we only include character with Tonos in our unaccent.rules.

> We don't exactly maintain this list manually, we extract it from
> Unicode source data. Can you see what needs to be adjusted in here to
> achieve that goal?

See commits like e3dd7c06e627 or 59f47fb98dab for some references.
Unfortunately, we've been using as policy to not backpatch any changes
to the in-core rules file, and you can plug in your own file. Saying
that, these additions sound like a natural addition seen from here.

> Perhaps a new range or something like that?

It seems to me that it is a bit more complicated than that, because
Unicode.data decomposes the characters with Oxia as characters with
Tonos, and then characters with Tonos are decomposed with the "base"
alphabet characters + Tonos. We do a recursive lookup at the unicode
table in get_plain_letter() and is_letter_with_marks(), so it seems to
me that we're not missing much, and I suspect that there should be no
need for a new custom range of characters..

Cees, perhaps you would like to get a shot at that?

[1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2024-02-25 23:25:36 Re: BUG #18362: unaccent rules and Old Greek text
Previous Message Thomas Munro 2024-02-25 23:15:57 Re: BUG #18362: unaccent rules and Old Greek text