Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2016-10-27 07:23:37
Message-ID: 20161027.162337.204351475.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, thank you very much for the work. My work became quite
easier with it.

At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <08e7892a-d55c-eefe-76e6-7910bc8dd1f3(at)iki(dot)fi>
> I'd reallly like to clean up all the current perl scripts, before we
> start to do the radix tree stuff. I worked through the rest of the
> conversions, and fixed/hacked the perl scripts so that they faithfully
> re-produce the mapping tables that we have in the repository
> currently. Whether those are the best mappings or not, or whether we
> should update them based on some authoritative source is another
> question, but let's try to nail down the process of creating the
> mapping tables.
>
> Tom Lane looked into this in Nov 2015
> (https://www.postgresql.org/message-id/28825.1449076551%40sss.pgh.pa.us). This
> is a continuation of that, to actually fix the scripts. This patch
> series doesn't change any of the mappings, only the way we produce the
> mapping tables.
>
> > Our UHC conversion tables contained a lot more characters than the
> CP949.TXT file it's supposedly based on. I rewrote the script to use
> "windows-949-2000.xml" file, from the ICU project, as the source
> instead. It's a much closer match to our mapping tables, containing
> all but one of the additional characters. We were already using
> gb-18030-2000.xml as the source in UCS_GB18030.pl, so parsing ICU's
> XML files isn't a new thing.
>
> The GB2312.TXT source file seems to have disappeared from the Unicode
> consortium's FTP site. I changed the UCS_to_EUC_CN.pl script to use
> gb-18030-2000.xml as the source instead. GB-18030 is an extension of
> GB-2312, UCS_to_EUC_CN.pl filters out the additional characters that
> are not in GB-2312.
>
> This now forms a reasonable basis for switching to radix tree. Every
> mapping table is now generated by the print_tables() perl function in
> convutils.pm. To switch to a radix tree, you just need to swap that
> function with one that produces a radix tree instead of the
> current-format mapping tables.

RADIXCONV.pm is merged into convutils.pm and the manner to
resolve reference is unified from $$x{} to $x->{}. (subroutine
call by '&' is not unified..) Now radix trees files are written
by the function with similar interface.

print_radix_trees($script_name, $encoding, \(at)mapping);

> The perl scripts are still quite messy. For example, I lost the checks
> for duplicate mappings somewhere along the way - that ought to be put
> back. My Perl skills are limited.

Perl scripts are to be messy, I believe. Anyway the duplicate
check as been built into the sub print_radix_trees. Maybe the
same check is needed by some plain map files but it would be just
duplication for the maps having radix tree.

The attached patches apply on top your patches and changes all
possible conversions to use radix tree (combined characters are
still using old-method). Addition to that, because of the
difficult-to-verify nature of the radix-tree data, I added
map_chekcer (make mapcheck) to check them agaist plain maps.

I have briefly checked with real characters for
SJIS/EUC-JP/BIG5/ISO8859-13 and radix conversion seems to work
correctly for them.

> This is now an orthogonal discussion, and doesn't need to block the
> radix tree work, but we should consider what we want to base our
> mapping tables on. Perhaps we could use the XML files from ICU as the
> source for all of the mappings?
>
> ICU seems to use a BSD-like license, so we could even include the XML
> files in our repository. Actually, looking at
> http://www.unicode.org/copyright.html#License, I think we could
> include the *.TXT files in our repository, too, if we wanted to. The
> *.TXT files are found under www.unicode.org/Public/, so that license
> applies. I think that has changed somewhat recently, because the
> comments in our perl scripts claim that the license didn't allow that.

For the convenience, all the required files are downloaded by
typing 'make download-texts'.

In the following document,

http://unicode.org/Public/ReadMe.txt

| Terms of Use
| http://www.unicode.org/copyright.html

http://www.unicode.org/copyright.html

| EXHIBIT 1
| UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE
| Unicode Data Files include all data files under the directories
| http://www.unicode.org/Public/, http://www.unicode.org/reports/,
...

| COPYRIGHT AND PERMISSION NOTICE
|
| Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
| Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
|
| Permission is hereby granted, free of charge, to any person obtaining
| a copy of the Unicode data files and any associated documentation
| (the "Data Files") or Unicode software and any associated documentation
| (the "Software") to deal in the Data Files or Software
| without restriction, including without limitation the rights to use,
| copy, modify, merge, publish, distribute, and/or sell copies of
| the Data Files or Software, and to permit persons to whom the Data Files
| or Software are furnished to do so, provided that either
| (a) this copyright and permission notice appear with all copies
| of the Data Files or Software, or
| (b) this copyright and permission notice appear in associated
| Documentation.

Perhaps we can put the files into our repositoy by providing some
notifications.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0004-Update-sjis-0213-2004-std.txt.patch.bz2 application/octet-stream 939 bytes
0005-Make-map-generators-to-generate-radix-tree-files.patch.bz2 application/octet-stream 15.3 KB
0006-Replace-map-files-with-radix-tree-files.patch.bz2 application/octet-stream 2.6 MB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2016-10-27 07:31:18 Re: emergency outage requiring database restart
Previous Message Etsuro Fujita 2016-10-27 07:16:41 Re: Push down more full joins in postgres_fdw