From: | "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | "Andrew Dunstan" <andrew(at)dunslane(dot)net> |
Cc: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: proposal: UTF8 to_ascii function |
Date: | 2008-08-11 14:49:23 |
Message-ID: | 162867790808110749t124533b5v49784a9204e58685@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
2008/8/11 Andrew Dunstan <andrew(at)dunslane(dot)net>:
>
>
> Jan Urbański wrote:
>>
>> Andrew Dunstan wrote:
>>>
>>>
>>> Pavel Stehule wrote:
>>>>
>>>>
>>>> One note - convert_to is correct. But we have to use to_ascii without
>>>> decode functions. It has same behave - convert from bytea to text.
>>>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
>>>> function prototypes are:
>>>>
>>>> to_ascii(text)
>>>> to_ascii(bytea, integer);
>>>> to_ascii(bytea, name);
>>>>
>>>>
>>>>>
>>>>>
>>>
>>> What you have not said is how you propose to convert UTF8 to ASCII.
>>>
>>> Currently to_ascii() converts a small number of single byte charsets to
>>> ASCII by folding the chars with high bits set, so what we get is a pure
>>> ASCII result which is safe in any server encoding, as they are all ASCII
>>> supersets.
>>>
>>> But what conversion rule will you use for the gazillions of Unicode
>>> characters?
>>>
>>> I honestly do not understand the use case for this at all.
>>
>> I do. Often clients want their searches to be
>> accented-or-language-specific letters insensitive. So searching for 'łódź'
>> returns 'lodz'. So the use case is there (in fact, the lack of such facility
>> made me consider not upgrading particular client to 8.3...).
>> Or maybe there's a better way to do it?
>
> Well, my first question would be "Why aren't you using a database encoding
> that supports to_ascii()?"
>
> However, I suppose that your use case would support this signature:
>
> to_ascii(bytea, name)
>
> where it would just error out if the encoding name were something other than
> LATIN1, LATIN2, LATIN9, or WIN1250.
>
> But what would be the meaning of this?:
>
> to_ascii(bytea, integer)
>
it's symmetric. Nothing more.
>
> cheers
>
> andrew
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2008-08-11 14:56:52 | Re: proposal: UTF8 to_ascii function |
Previous Message | Pavel Stehule | 2008-08-11 14:48:03 | Re: proposal: UTF8 to_ascii function |