Re: unaccent extension missing some accents

From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: unaccent extension missing some accents
Date: 2011-11-07 16:46:46
Message-ID: CADFUPgeUqK3qqUkV=8H85UXcLMmKq7oHtm4tAkpf2n16Xsk0MQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I looked at this a bit and realized that sscanf is actually doing a
> couple of critical things for us, which are lost in translation when
> doing it like this:
>
> 1. It ignores whitespace other than the dividing tab.  If we don't
> continue to do that, we'll likely break existing config files.
>
> 2. It ensures that src and trg each consist of at least one (nonblank)
> character.  placeChar() is critically dependent on the assumption that
> src is not empty.
>
> However, after looking around a bit at the other tsearch config-file-
> reading functions, I noted that they all use t_isspace() to identify
> whitespace ... and that function in fact should be okay on OS X,
> because it uses iswspace in multibyte encodings.
>
> So it's fairly simple to improve this code to reject whitespace that
> way.  I don't like the existing code anyway because of its potential
> vulnerability to buffer overrun.  I'll fix it up and commit.
>
>> As for the other problems with isspace and such on OSX, it might be
>> worth looking at the python portability fixes.
>
> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
> work), I might be tempted to try to do it that way, but I still fail
> to see the point.  After reviewing the code I feel that unaccent needs
> to be fixed because it's not consistent with the other tsearch config
> file parsers, and not so much because it works or doesn't work on any
> specific platform.
>

Yeah, I never knew there was such a problem with OSX and UTF8 before
running into it here but it's good to know. When I noticed the
unnaccent extension in more recent PostgreSQL versions, I figured it
would perform better than our current plperl-based accent stripping
function (which it surely does) and just noticed the results on my
machine were a little off, but our linux-based servers were fine and
dandy and yadda yadda yadda.

Anyways, lemme know if there's anything else I could help with or
could test and whatnot. Cheers.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2011-11-07 16:53:04 Re: unaccent extension missing some accents
Previous Message Jeff Davis 2011-11-07 16:28:15 Re: btree gist known problems