Re: BUG #15347: Unaccent for greek characters does not work

From: Tasos Maschalidis <TaS(dot)O(dot)S(at)hotmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15347: Unaccent for greek characters does not work
Date: 2018-08-23 12:22:41
Message-ID: VI1PR01MB38537EBD529FE5EE3FE9A5FEB5370@VI1PR01MB3853.eurprd01.prod.exchangelabs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Thomas,

Your concerns are understandable, especially when Klingon is taken into consideration.

I am not familiar enough with python to set up something to run the script and check the result, but I am more than willing to review the results! If you need any more input from my part (being a native Greek speaker) please ask away!

If I understood correctly, I guess to include the greek characters the method would need to change to this?:

return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
(codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \

(codepoint.id >= ord('α') and codepoint.id <= ord('ω')) or \
(codepoint.id >= ord('Α') and codepoint.id <= ord('Ω'))

Thanks,

Tasos Maschalidis

Ps: This gist is what the results should look like, considering greek characters (lines 190-409).

________________________________
Από: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Στάλθηκε: Thursday, August 23, 2018 8:22:21 AM
Προς: tas(dot)o(dot)s(at)hotmail(dot)com; PostgreSQL mailing lists
Θέμα: Re: BUG #15347: Unaccent for greek characters does not work

On Thu, Aug 23, 2018 at 3:08 AM, PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
> The following bug has been logged on the website:
>
> Bug reference: 15347
> Logged by: Tasos Maschalidis
> Email address: tas(dot)o(dot)s(at)hotmail(dot)com
> PostgreSQL version: 9.3.18
> Operating system: Ubuntu 4.8.4
> Description:
>
> Call to unaccent function with greek characters does not return the greek
> characters without the accents as expected (not even just the few diacritics
> used in modern Greek).

Hello Tasos,

Right. We generate the unaccent.rules file from the Unicode data file
using the Python script contrib/unaccent/generate_unaccent_rules.py in
the PostgreSQL source tree. The script currently limits itself to
Latin characters here:

def is_plain_letter(codepoint):
"""Return true if codepoint represents a plain ASCII letter."""
return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
(codepoint.id >= ord('A') and codepoint.id <= ord('Z'))

I was not brave enough to support other kinds of characters, because I
can't read 'em and check if the results are garbage (if you remove the
diacritics from Klingon, it might change the meaning of any word into
a declaration of war for all I know). If you know Python and would
like to have a go at modifying that script to support Greek, please
do! Otherwise perhaps I could try to do it and you could review the
results.

There is a precedent already that it knows how to remove a diacritic
from at least one Cyrillic character. I think there is no reason at
all we shouldn't take a patch to support Greek or any other alphabet
that a native speaker can advise us on.

I think the chances of squeaking a change into PostgreSQL 11 are slim,
since it would require a special exception from the Release Management
Team at this point. Failing that, it'd be for PostgreSQL 12. We
don't usually back-patch unaccent.rules changes because they can
affect in indexed data, and we don't want minor version upgrades to
break stuff.

[1] https://www.postgresql.org/message-id/CAEepm%3D1KRVinFtuDao4L%2BqSBh4T4k3z996EwD5-zgytu4Qa5Fw%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergei Kornilov 2018-08-23 12:41:21 Re: 11 beta 3 / ROLLBACK TO SAVEPOINT regression in PLPGSQL
Previous Message David Klika 2018-08-23 11:40:38 11 beta 3 / ROLLBACK TO SAVEPOINT regression in PLPGSQL