Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 05:36:02
Message-ID: 8506.1545111362@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
> the same time? That would be nice to check easily the extent of the
> patches proposed on this thread.

I wonder why unaccent.sql is set up to run its tests in KOI8 client
encoding rather than UTF8. It doesn't seem like it's the business
of this test script to be verifying transcoding from KOI8 to UTF8
(and if it were meant to do that, it's a pretty incomplete test...).
But having it set up like that means that we can't directly add
such tests to unaccent.sql, because there are no combining diacritics
in the KOI8 character set. We have two unattractive options:

* Change client encodings partway through unaccent.sql. I think this
would be disastrous for editability of that file; no common tools
will understand the encoding change.

* Put the new test cases into a separate file with a different client
encoding. This is workable, I suppose, but it seems pretty silly
when the tests are only a few queries apiece.

Another problem I've got with the current setup is that it seems
unlikely that many people's editors default to an assumption of
KOI8 encoding. Mine guesses that these files are UTF8, and so
the test cases look perfectly insane. They do make sense if
I transcode the files to UTF8, but I wonder why we're not shipping
them as UTF8 in the first place.

tl;dr: I think we should convert unaccent.sql and unaccent.out
to UTF8 encoding. Then, adding more test cases for this patch
will be easy.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2018-12-18 05:51:10 Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction
Previous Message Michael Paquier 2018-12-18 05:04:19 Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Lepikhov 2018-12-18 05:41:48 Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist
Previous Message Michael Paquier 2018-12-18 04:57:08 Re: BUG #15548: Unaccent does not remove combining diacritical characters