Re: Case Insensitive Queries

From: "tjk(at)tksoft(dot)com" <tjk(at)tksoft(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane)
Cc: tjk(at)tksoft(dot)com (tjk(at)tksoft(dot)com), danlyke(at)flutterby(dot)com (Dan Lyke), pgsql-sql(at)postgresql(dot)org
Subject: Re: Case Insensitive Queries
Date: 2001-05-30 14:37:45
Message-ID: 200105301437.HAA30076@smtp3.tksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


I use a few of them, and in my opinion there is a distinct group of
characters at last in the 8859-1 character set which have a lower and
upper case instance. The ranges are 0xC0 to 0xDD for upper case and 0xE0
to 0xFD for upper (with the exception of 0xD0, 0xD7, 0xF0, and 0xF7).
I haven't examined all the relevant docs, so I might be wrong. The lists
are based on my own observations of the characters in question.

There is probably no harm in sending a few extra bytes, so I am appending
a related function below. If someone finds a flaw with the function,
please tell me; that would be greatly appreciated.

I am also including a list of related characters. This email is going
out with 8859-1 as the charset, so I hope you are able to view them.

UPPER CASE:
192: À (0xc0) 193: Á (0xc1) 194: Â (0xc2) 195: Ã (0xc3)
196: Ä (0xc4) 197: Å (0xc5) 198: Æ (0xc6) 199: Ç (0xc7)
200: È (0xc8) 201: É (0xc9) 202: Ê (0xca) 203: Ë (0xcb)
204: Ì (0xcc) 205: Í (0xcd) 206: Î (0xce) 207: Ï (0xcf)
209: Ñ (0xd1) 210: Ò (0xd2) 211: Ó (0xd3)
212: Ô (0xd4) 213: Õ (0xd5) 214: Ö (0xd6)
216: Ø (0xd8) 217: Ù (0xd9) 218: Ú (0xda) 219: Û (0xdb)
220: Ü (0xdc) 221: Ý (0xdd)

LOWER CASE:
224: à (0xe0) 225: á (0xe1) 226: â (0xe2) 227: ã (0xe3)
228: ä (0xe4) 229: å (0xe5) 230: æ (0xe6) 231: ç (0xe7)
232: è (0xe8) 233: é (0xe9) 234: ê (0xea) 235: ë (0xeb)
236: ì (0xec) 237: í (0xed) 238: î (0xee) 239: ï (0xef)
241: ñ (0xf1) 242: ò (0xf2) 243: ó (0xf3)
244: ô (0xf4) 245: õ (0xf5) 246: ö (0xf6)
248: ø (0xf8) 249: ù (0xf9) 250: ú (0xfa) 251: û (0xfb)
252: ü (0xfc) 253: ý (0xfd)

SKIPPED
208: Ð (0xd0)
215: × (0xd7)
222: Þ (0xde)
240: ð (0xf0)
247: ÷ (0xf7)
254: þ (0xfe)

CREATE FUNCTION lower8859_1 (text) RETURNS text
AS '/usr/include/pgsql/lib/str8859_1.so'
LANGUAGE 'C';

/* No warranty of any kind, use at your own risk. Use freely.
*/

text * lower8859_1 (text * str1) {
text * result;
int32 len1 = 0, i;
unsigned char * p, * p2, c;
unsigned char upper_min = 0xC0;
unsigned char upper_max = 0xDD;

len1 = VARSIZE(str1) - VARHDRSZ;

if (len1 <= 0)
return str1;

result = (text *) palloc (len1 + 2 + VARHDRSZ);
if (! result)
return str1;

memset (result, 0, len1 + 2 + VARHDRSZ);

p = VARDATA(result);
p2 = VARDATA(str1);

for (i=0; i < len1; i++) {
c = p2[i];
if (isupper(c) || (c >= upper_min && c <= upper_max && c != 0xD0 && c != 0xD7))
p[i] = c + 0x20;
else
p[i] = c;
}

VARSIZE(result) = len1 + VARHDRSZ;

return result;
}

Troy


> "tjk(at)tksoft(dot)com" <tjk(at)tksoft(dot)com> writes:
> > If upper() and lower() operate on characters in 8859-1 and other character
> > sets when the appropriate locale is set, then a difference in the behavior
> > of upper() and lower() would seem like a bug.
>
> Au contraire ... upper() and lower() are not symmetric operations in
> quite a few non-English locales. I'll let those who regularly work with
> them give specific details, but handling of accents, German esstet (sp?),
> etc are the gotchas that I recall.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2001-05-30 14:39:34 Re: ERROR: Class '37632' not found
Previous Message Tom Lane 2001-05-30 14:23:54 Re: Case Insensitive Queries