Re: Radix tree for character conversion

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2016-10-25 09:23:48
Message-ID: 08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/21/2016 11:33 AM, Kyotaro HORIGUCHI wrote:
> Hello, this is new version of radix charconv.
>
> At Sat, 8 Oct 2016 00:37:28 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <6d85d710-9554-a928-29ff-b2d3b80b01c9(at)iki(dot)fi>
>> What I don't want is that the current *.map files are turned into the
>> authoritative source files, that we modify by hand. There are no
>> comments in them, for starters, which makes hand-editing
>> cumbersome. It seems that we have edited some of them by hand already,
>> but we should rectify that.
>
> Agreed. So, I identifed source files of each character for EUC_JP
> and SJIS conversions to clarify what has been done on them.
>
> SJIS conversion is made from CP932.TXT and 8 additional
> conversions for UTF8->SJIS and none for SJIS->UTF8.
>
> EUC_JP is made from CP932.TXT and JIS0212.TXT. JIS0201.TXT and
> JIS0208.TXT are useless. It adds 83 or 86 (different by
> direction) conversion entries.
>
> http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
>
> Now the generator scripts don't use *.map as source and in turn
> generates old-style map files as well as radix tree files.
>
> For convenience, UCS_to_(SJIS|EUC_JP).pl takes parater --flat and
> -v. The format generates the old-style flat map as well as radix
> map file and additional -v adds source description for each line
> in the flat map file.
>
> During working on this, EUC_JP map lacks some conversions but it
> is another issue.

Thanks!

I'd reallly like to clean up all the current perl scripts, before we
start to do the radix tree stuff. I worked through the rest of the
conversions, and fixed/hacked the perl scripts so that they faithfully
re-produce the mapping tables that we have in the repository currently.
Whether those are the best mappings or not, or whether we should update
them based on some authoritative source is another question, but let's
try to nail down the process of creating the mapping tables.

Tom Lane looked into this in Nov 2015
(https://www.postgresql.org/message-id/28825.1449076551%40sss.pgh.pa.us).
This is a continuation of that, to actually fix the scripts. This patch
series doesn't change any of the mappings, only the way we produce the
mapping tables.

Our UHC conversion tables contained a lot more characters than the
CP949.TXT file it's supposedly based on. I rewrote the script to use
"windows-949-2000.xml" file, from the ICU project, as the source
instead. It's a much closer match to our mapping tables, containing all
but one of the additional characters. We were already using
gb-18030-2000.xml as the source in UCS_GB18030.pl, so parsing ICU's XML
files isn't a new thing.

The GB2312.TXT source file seems to have disappeared from the Unicode
consortium's FTP site. I changed the UCS_to_EUC_CN.pl script to use
gb-18030-2000.xml as the source instead. GB-18030 is an extension of
GB-2312, UCS_to_EUC_CN.pl filters out the additional characters that are
not in GB-2312.

This now forms a reasonable basis for switching to radix tree. Every
mapping table is now generated by the print_tables() perl function in
convutils.pm. To switch to a radix tree, you just need to swap that
function with one that produces a radix tree instead of the
current-format mapping tables.

The perl scripts are still quite messy. For example, I lost the checks
for duplicate mappings somewhere along the way - that ought to be put
back. My Perl skills are limited.

This is now an orthogonal discussion, and doesn't need to block the
radix tree work, but we should consider what we want to base our mapping
tables on. Perhaps we could use the XML files from ICU as the source for
all of the mappings?

ICU seems to use a BSD-like license, so we could even include the XML
files in our repository. Actually, looking at
http://www.unicode.org/copyright.html#License, I think we could include
the *.TXT files in our repository, too, if we wanted to. The *.TXT files
are found under www.unicode.org/Public/, so that license applies. I
think that has changed somewhat recently, because the comments in our
perl scripts claim that the license didn't allow that.

- Heikki

Attachment Content-Type Size
0001-Remove-code-points-0x80-from-character-conversion-ta.patch.bz2 application/x-bzip 3.7 KB
0002-Remove-unnecessary-leading-zeros.patch.bz2 application/x-bzip 616.1 KB
0003-Rewrite-the-perl-scripts-to-produce-our-Unicode-conv.patch.bz2 application/x-bzip 12.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Moser 2016-10-25 09:44:02 Re: [PROPOSAL] Temporal query processing with range types
Previous Message Kyotaro HORIGUCHI 2016-10-25 09:21:50 Re: asynchronous execution