From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | hlinnaka(at)iki(dot)fi |
Cc: | robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Radix tree for character conversion |
Date: | 2016-10-21 08:33:21 |
Message-ID: | 20161021.173321.105120238.horiguchi.kyotaro@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello, this is new version of radix charconv.
At Sat, 8 Oct 2016 00:37:28 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <6d85d710-9554-a928-29ff-b2d3b80b01c9(at)iki(dot)fi>
> What I don't want is that the current *.map files are turned into the
> authoritative source files, that we modify by hand. There are no
> comments in them, for starters, which makes hand-editing
> cumbersome. It seems that we have edited some of them by hand already,
> but we should rectify that.
Agreed. So, I identifed source files of each character for EUC_JP
and SJIS conversions to clarify what has been done on them.
SJIS conversion is made from CP932.TXT and 8 additional
conversions for UTF8->SJIS and none for SJIS->UTF8.
EUC_JP is made from CP932.TXT and JIS0212.TXT. JIS0201.TXT and
JIS0208.TXT are useless. It adds 83 or 86 (different by
direction) conversion entries.
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
Now the generator scripts don't use *.map as source and in turn
generates old-style map files as well as radix tree files.
For convenience, UCS_to_(SJIS|EUC_JP).pl takes parater --flat and
-v. The format generates the old-style flat map as well as radix
map file and additional -v adds source description for each line
in the flat map file.
During working on this, EUC_JP map lacks some conversions but it
is another issue.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
0001-Radix-tree-infrastructure-for-character-encoding.patch | text/x-patch | 26.9 KB |
0002-Use-radix-tree-for-UTF8-ShiftJIS-conversion.patch | text/x-patch | 369.3 KB |
0003-Use-radix-tree-for-UTF8-EUC_JP-conversion.patch | text/x-patch | 593.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | David Steele | 2016-10-21 08:50:36 | Re: Renaming of pg_xlog and pg_clog |
Previous Message | Tsunakawa, Takayuki | 2016-10-21 08:24:45 | [RFC] Transaction management overhaul is necessary? |