Re: unaccent extension missing some accents

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: unaccent extension missing some accents
Date: 2011-11-07 00:15:04
Message-ID: 27438.1320624904@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

J Smith <dark(dot)panda+lists(at)gmail(dot)com> writes:
> I've attached a patch against master for unaccent.c that uses swscanf
> along with char2wchar and wchar2char instead of sscanf directly to
> initialize the unaccent extension and it appears to fix the problem in
> both the master and 9.1 branches.

swscanf doesn't seem like an acceptable approach: it's a function that
is relied on nowhere else in PG, so it adds new portability risks of its
own. It doesn't exist on some platforms that we support (like the one
I'm typing this message on) and there's no real good reason to assume
that it's not broken in its own ways on others.

If you really want to pursue this, I'd suggest parsing the line
manually, perhaps via strchr searches for \t and \n. It likely wouldn't
be very many more lines than what you've got here.

However, the bigger picture is that OS X's UTF8 locales are broken
through-and-through, and most of their other problems are not feasible
to work around. So basically you can't use them for anything
interesting, and it's not clear that it's worth putting any time into
solving individual problems. In the particular case here, the issue
presumably is that sscanf is relying on isspace() ... but we rely on
isspace() directly, in quite a lot of places, so how much is it going
to fix to dodge it right here?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2011-11-07 02:28:20 btree gist known problems
Previous Message J Smith 2011-11-06 23:43:22 Re: unaccent extension missing some accents