Skip site navigation (1) Skip section navigation (2)

Re: Latest on CITEXT 2.0

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 17:09:37
Message-ID: 7998C08A-D40B-4081-A343-1EA1B3FA7976@kineticode.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On Jun 26, 2008, at 10:02, Tom Lane wrote:

> BTW, I don't think you can use that same-length optimization for
> citext.  There's no reason to think that upper/lowercase pairs will
> have the same length all the time in multibyte encodings.

I was wondering about that. I had been thinking of canonically- 
equivalent stings and combining marks. Doing a quick test it looks  
like combining marks are not equivalent. For example, this returns  
false:

   SELECT 'Ä'::text = 'Ä'::text;

At least with en_US.UTF-8. Hrm. It looks like my client makes them  
both canonical, so I've attached a script demonstrating this issue.

Anyway, I was aware of different byte counts for canonical  
equivalence, but not for differences between upper- and lowercase  
characters. I'd certainly defer to your knowledge of how these things  
truly work in PostgreSQL, Tom, and can of course easily remove that  
optimization. So, are your certain about this?

Many thanks,

David


Attachment: try.sql
Description: application/octet-stream (34 bytes)

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2008-06-26 17:22:22
Subject: Re: Regd: TODO Item
Previous:From: Tom LaneDate: 2008-06-26 17:02:19
Subject: Re: Latest on CITEXT 2.0

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group