Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 06:23:57
Message-ID: 11345.1545114237@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
>> tl;dr: I think we should convert unaccent.sql and unaccent.out
>> to UTF8 encoding. Then, adding more test cases for this patch
>> will be easy.

> Do you think that we could also remove the non-ASCII characters from the
> tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in
> input, and show the output with bytea.

I'm not really for that, because it would make the test cases harder
to verify by eyeball. With the current setup --- other than the
uncommon-outside-Russia encoding choice --- you don't really need
to read or speak Russian to see that this:

SELECT unaccent('ёлка');
unaccent
----------
елка
(1 row)

probably represents unaccent doing what it ought to. If everything
is in hex then it's a lot harder.

Ten years ago I might've agreed with your point, but today it's
hard to believe that anyone who takes any interest at all in
unaccent's functionality would not have a UTF8-capable terminal.

> That's harder to read, still we
> discussed about not using UTF-8 in the python script to allow folks with
> simple terminals to touch the code the last time this was touched
> (5e8d670) and the characters used could be documented as comments in the
> tests.

Maybe I'm misremembering, but I thought that discussion was about the
code files. I am still mistrustful of non-ASCII in our code files.
But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-12-18 06:33:04 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Amit Langote 2018-12-18 06:12:53 Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-12-18 06:26:27 Re: don't create storage when unnecessary
Previous Message Michael Paquier 2018-12-18 06:07:35 Re: BUG #15548: Unaccent does not remove combining diacritical characters