Skip site navigation (1) Skip section navigation (2)

Re: Radix tree for character conversion

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2016-10-25 09:23:48
Message-ID: 08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-hackers
On 10/21/2016 11:33 AM, Kyotaro HORIGUCHI wrote:
> Hello, this is new version of radix charconv.
>
> At Sat, 8 Oct 2016 00:37:28 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <6d85d710-9554-a928-29ff-b2d3b80b01c9(at)iki(dot)fi>
>> What I don't want is that the current *.map files are turned into the
>> authoritative source files, that we modify by hand. There are no
>> comments in them, for starters, which makes hand-editing
>> cumbersome. It seems that we have edited some of them by hand already,
>> but we should rectify that.
>
> Agreed. So, I identifed source files of each character for EUC_JP
> and SJIS conversions to clarify what has been done on them.
>
> SJIS conversion is made from CP932.TXT and 8 additional
> conversions for UTF8->SJIS and none for SJIS->UTF8.
>
> EUC_JP is made from CP932.TXT and JIS0212.TXT. JIS0201.TXT and
> JIS0208.TXT are useless. It adds 83 or 86 (different by
> direction) conversion entries.
>
> http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
>
> Now the generator scripts don't use *.map as source and in turn
> generates old-style map files as well as radix tree files.
>
> For convenience, UCS_to_(SJIS|EUC_JP).pl takes parater --flat and
> -v. The format generates the old-style flat map as well as radix
> map file and additional -v adds source description for each line
> in the flat map file.
>
> During working on this, EUC_JP map lacks some conversions but it
> is another issue.

Thanks!

I'd reallly like to clean up all the current perl scripts, before we 
start to do the radix tree stuff. I worked through the rest of the 
conversions, and fixed/hacked the perl scripts so that they faithfully 
re-produce the mapping tables that we have in the repository currently. 
Whether those are the best mappings or not, or whether we should update 
them based on some authoritative source is another question, but let's 
try to nail down the process of creating the mapping tables.

Tom Lane looked into this in Nov 2015 
(https://www.postgresql.org/message-id/28825.1449076551%40sss.pgh.pa.us). 
This is a continuation of that, to actually fix the scripts. This patch 
series doesn't change any of the mappings, only the way we produce the 
mapping tables.

Our UHC conversion tables contained a lot more characters than the 
CP949.TXT file it's supposedly based on. I rewrote the script to use 
"windows-949-2000.xml" file, from the ICU project, as the source 
instead. It's a much closer match to our mapping tables, containing all 
but one of the additional characters. We were already using 
gb-18030-2000.xml as the source in UCS_GB18030.pl, so parsing ICU's XML 
files isn't a new thing.

The GB2312.TXT source file seems to have disappeared from the Unicode 
consortium's FTP site. I changed the UCS_to_EUC_CN.pl script to use 
gb-18030-2000.xml as the source instead. GB-18030 is an extension of 
GB-2312, UCS_to_EUC_CN.pl filters out the additional characters that are 
not in GB-2312.

This now forms a reasonable basis for switching to radix tree. Every 
mapping table is now generated by the print_tables() perl function in 
convutils.pm. To switch to a radix tree, you just need to swap that 
function with one that produces a radix tree instead of the 
current-format mapping tables.

The perl scripts are still quite messy. For example, I lost the checks 
for duplicate mappings somewhere along the way - that ought to be put 
back. My Perl skills are limited.


This is now an orthogonal discussion, and doesn't need to block the 
radix tree work, but we should consider what we want to base our mapping 
tables on. Perhaps we could use the XML files from ICU as the source for 
all of the mappings?

ICU seems to use a BSD-like license, so we could even include the XML 
files in our repository. Actually, looking at 
http://www.unicode.org/copyright.html#License, I think we could include 
the *.TXT files in our repository, too, if we wanted to. The *.TXT files 
are found under www.unicode.org/Public/, so that license applies. I 
think that has changed somewhat recently, because the comments in our 
perl scripts claim that the license didn't allow that.

- Heikki


Attachment: 0001-Remove-code-points-0x80-from-character-conversion-ta.patch.bz2
Description: application/x-bzip (3.7 KB)
Attachment: 0002-Remove-unnecessary-leading-zeros.patch.bz2
Description: application/x-bzip (616.1 KB)
Attachment: 0003-Rewrite-the-perl-scripts-to-produce-our-Unicode-conv.patch.bz2
Description: application/x-bzip (12.8 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Peter MoserDate: 2016-10-25 09:44:02
Subject: Re: [PROPOSAL] Temporal query processing with range types
Previous:From: Kyotaro HORIGUCHIDate: 2016-10-25 09:21:50
Subject: Re: asynchronous execution

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group