Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2022-01-03 13:07:09
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>> Erm, it looks like something weird is happening somewhere in cfbot's
>> pipeline, because Dag's patch says:
>> +SELECT daitch_mokotoff('Straßburg');
>> + daitch_mokotoff
>> +-----------------
>> + 294795
>> +(1 row)
> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
> I suppose? I wonder what the LANG environment is in that cfbot
> instance.
> (We do have methods for dealing with non-ASCII test cases, but
> I can't see that this patch is using any of them.)
> regards, tom lane

I naively assumed that tests would be run in an UTF8 environment.

Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
two other modules are using UTF8 characters in tests - citext and

The citext tests seem to be commented out - "Multibyte sanity
tests. Uncomment to run."

Looking into the unaccent module, I don't quite understand how it will
work with various encodings, since it doesn't seem to decode its input -
will it fail if run under anything but ASCII or UTF8?

In any case, I see that unaccent.sql starts as follows:


-- must have a UTF8 database
SELECT getdatabaseencoding();

SET client_encoding TO 'UTF8';

Would doing the same thing in fuzzystrmatch.sql fix the problem with
failing tests? Should I prepare a new patch?

Best regards

Dag Lem

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-01-03 13:18:35 Re: Add Boolean node
Previous Message Suraj Kharage 2022-01-03 13:05:55 Remove extra spaces