Re: unaccent extension missing some accents

From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: unaccent extension missing some accents
Date: 2011-11-06 23:43:22
Message-ID: CADFUPgeEw31kAoY3_9nH==uP9QesYKKTwLV_OgwVKM=P1VvnFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>
> What's the locale of the database you're seeing this in, and which charset
> does it use?
>
> I think scanf() uses isspace() and friends, and last time I looked the
> locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
> decides that 0xA0 is whitespace.
>

Ah, that does it: the locale I was using in the test code was just
plain ol' C locale, while in the database it was en_CA.UTF-8. Changing
the locale in my test code produced the wonky results. I should have
figured it was a locale problem. Sure enough, in a UTF-8 locale, it
believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour
indeed.

Apparently this is a known OSX issue that has its roots in and older
version of FreeBSD's libc I guess, eh? I've found various bug reports
that allude to the problem and they all seem to point that way.

I've attached a patch against master for unaccent.c that uses swscanf
along with char2wchar and wchar2char instead of sscanf directly to
initialize the unaccent extension and it appears to fix the problem in
both the master and 9.1 branches.

I haven't added any tests in the expected output file 'cause I'm not
exactly sure what I should be testing against, but I could take a
crack at that, too, if the patch looks reasonable and is usable.

Cheers.

Attachment Content-Type Size
0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch application/octet-stream 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-11-07 00:15:04 Re: unaccent extension missing some accents
Previous Message YAMAMOTO Takashi 2011-11-06 23:08:07 reduce null bitmap size