Re: Yet another problem with ILIKE and UTF-8

From: "Gergely Bor" <borg42(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Yet another problem with ILIKE and UTF-8
Date: 2007-10-26 14:17:43
Message-ID: 396c97310710260717w38a1736y7d05d0bf15b9359e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Gregory, hello all,

I've tested the DB created with hu_HU.UTF8 lc_* settings and it works
like a charm! We'll re-create the database and move the contents. It's
a nice improvement of 8.3 to detect and disallow such
misconfiguration.

------
Summing up the issue & fix (for googling):
- Version: postgresql 8.2.x (and possibly older ones as well)
- Symptom: UTF-8 values are stored correctly but some ILIKE queries
don't return results as expected, even when searching for ASCII
substrings.
- Reason: database misconfiguration -> UPPER()/LOWER() cannot handle
UTF-8 chars -> ILIKE cannot handle UTF-8 chars. The error above occurs
if the DB is created with UTF-8 internal encoding, but lc_* settings
are not referring UTF-8 locales. (Eg. bad locale reference: "hu_HU",
good locale reference: "hu_HU.UTF8".)
- Fix: a new database (or DB cluster?) should be created with the
corrected lc_* settings. If there's data already, it should be dumped
and restored (see steps below).
- Additional info: lc_* settings (and a whole lot of others) can be
displayed with the psql command "SHOW ALL".
- Additional info: in case the DB is configured incorrectly, psql 8.2
just fails to match rows correctly (without any errors), but 8.3+ will
reject this bad configuration in time.
------

Thanks a lot, guys!

Best regards,
Gergely BOR

On 10/25/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>
> "Gergely Bor" <borg42(at)gmail(dot)com> writes:
>
> > We'll google the initdb stuff and try it ASAP.
> >
> > What I've tried is LOWER and UPPER, and they seem to return trash for
> > Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm...
> > maybe ILIKE requires LOWER and UPPER to work? Would not be
> > illogical...)
>
> It does. I think it works by just downcasing both strings. It's possible to do
> better but tricky. I think 8.3 has an optimization for that for single-byte
> encodings but it had to be disabled for utf-8 in the end.
>
> If it's returning trash for those characters then it's not prepared to handle
> UTF-8 data. You have to use an encoding compatible with your locale and
> vice-versa.
>
> If you want to store UTF-8 data I suggest you
>
> . add hu_HU.UTF-8 to /etc/locale.gen,
> . rerun /usr/sbin/locale-gen
> . pg_dump your database
> . re-initdb with the locale set to hu_HU.UTF-8
> . pg_restore your data.
>
> Unfortunately that'll take quite a while and involve down-time.
>
> You should probably do this in a second directory aside from your existing
> database just in case you've created any invalidly encoded utf-8 strings.
> You'll have to fix them before restoring. (Actually I don't recall which
> version got strict about that.)
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2007-10-26 14:29:02 Re: Possible planner bug/regression introduced in 8.2.5
Previous Message Lee Packham 2007-10-26 14:08:21 Re: BUG #3699: Fails to compile DTrace Support