From: | Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: patch: utf8_to_unicode (trivial) |
Date: | 2010-08-13 07:12:44 |
Message-ID: | AANLkTin2x3OaKFZXNpMR+Z3WBDA_3d5QNp_dRYF4JzOJ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 27, 2010 at 1:31 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Jul 24, 2010 at 10:34 PM, Joseph Adams
> <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 ,
>> but no corresponding utf8_to_unicode . However, there is a static
>> function called utf2ucs that does what utf8_to_unicode would do.
>>
>> I'd like this function to be available because the JSON code needs to
>> convert UTF-8 to and from Unicode codepoints, and I'm currently using
>> a separate UTF-8 to codepoint function for that.
>>
>> This patch renames utf2ucs to utf8_to_unicode and makes it public. It
>> also fixes the version of utf2ucs in src/bin/psql/mbprint.c so that
>> it's equivalent to the one in wchar.c .
>>
>> This is a patch against CVS HEAD for application. It compiles and
>> tests successfully.
>>
>> Comments? Thanks,
>
> I feel obliged to respond this since I'm supposed to be covering your
> GSoC project while Magnus is on vacation, but I actually know very
> little about this topic. What's undeniable, however, is that the
> coding in the two versions of utf8ucs() in the tree right now don't
> match. src/backend/utils/mb/wchar.c has:
>
> else if ((*c & 0xf8) == 0xf0)
>
> while src/bin/psql/mbprint.c, which is otherwise identical, has:
>
> else if ((*c & 0xf0) == 0xf0)
>
> I'm inclined to believe that your patch is right to think that the
> former version is correct, because it used to match the latter version
> until Tom Lane changed it in 2007, and I suspect he simply failed to
> update both copies. But I'd like someone who actually understands
> what this code is doing to confirm that.
>
> http://archives.postgresql.org/pgsql-committers/2007-01/msg00293.php
>
> I suspect we need to not only fix this, but back-patch it at least to
> 8.2, which is the first release where there are two copies of this
> function. I am not sure whether earlier releases need to be changed,
> or not. But again, someone who understands the issues better than I
> do needs to weigh in here.
>
> In terms of making this function non-static, I'm inclined to think
> that a better approach would be to move it to src/port. That gets rid
> of the need to have two copies in the first place.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>
I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.
This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
. It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.
I'm not sure whether I like the old patch better or the new one.
Joey Adams
Attachment | Content-Type | Size |
---|---|---|
utf8-to-unicode-port.patch | application/octet-stream | 5.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Boxuan Zhai | 2010-08-13 08:25:47 | Re: MERGE command for inheritance |
Previous Message | Heikki Linnakangas | 2010-08-13 06:33:22 | Re: MERGE command for inheritance |