Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-17 00:51:08
Message-ID: CAAhbUMOieimkZrCjpw2vQJ-k3p_jzzNsimdi0aq7dwTvKy2isA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Mon, 11 Feb 2019 at 15:57, Ramanarayana <raam(dot)soft(at)gmail(dot)com> wrote:

> Hi Hugh,
>
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is
> set to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense
>
> Regards,
> Ram
>

Hi Ram,
I took a look at this, and unfortunately the proposed fix breaks Python 2
(sys.stdout.encoding isn't a writable attribute in Python 2) :-(. I've
attached a patch which is compatible with both versions, and have confirmed
that the output is identical across Python 2 and 3 and across both Windows
and Linux. The output on Windows and Linux is identical, once the
difference in line endings is accounted for.

I've also opened the Unicode data file in UTF-8 and added a "with" block
which ensures we close the file when we are done with it. The change makes
the Python2 compatibility a little more complex (2 blocks to remove), but
it's the cleanest I could achieve.

The attached patch goes on top of patch 02 (not on top of the broken,
committed 03). I'm hoping that's not a problem. If it is, let me know and
I'll factor out the changes.

Please let me know if you have any questions.

Best wishes,
Hugh

Attachment Content-Type Size
generate_unaccent_rules-remove-combining-diacritical-accents-04.patch text/x-patch 6.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ramanarayana 2019-02-17 07:15:39 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Maeldron T. 2019-02-16 19:13:32 Re: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2019-02-17 02:40:05 Re: Actual Cost
Previous Message Tomas Vondra 2019-02-17 00:34:20 Re: Early WIP/PoC for inlining CTEs