Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Radix tree for character conversion
Date: 2016-10-07 08:36:06
Message-ID: 20161007.173606.217452136.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is a differnet topic from the original thread so I renamed
the subject and repost. Sorry for duplicate posting.

======================
Hello, I did this.

As a result, radix tree is about 1.5 times faster and needs a
half memory.

At Wed, 21 Sep 2016 15:14:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20160921(dot)151427(dot)265121484(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> I'll work on this for the next CF.

The radix conversion function and map conversion script became
more generic than the previous state. So I could easily added
radix conversion of EUC_JP in addition to SjiftJIS.

nm -S said that the size of radix tree data for sjis->utf8
conversion is 34kB and that for utf8->sjis is 46kB. (eucjp->utf8
57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
not missing something, radix tree is faster and require less
memory.

A simple test where 'select '7070 sjis chars' x 100' (I'm not
sure, but the size is 1404kB) on local connection shows that this
is fast enough.

radix: real 0m0.285s / user 0m0.199s / sys 0m0.006s
master: real 0m0.418s / user 0m0.180s / sys 0m0.004s

To make sure, the result of a test of sending the same amount of
ASCII string (1404kB) on SJIS and UTF8(no-conversion) encoding is
as follows.

ascii/utf8-sjis: real 0m0.220s / user 0m0.176s / sys 0m0.011s
ascii/utf8-utf8: real 0m0.137s / user 0m0.111s / sys 0m0.008s

======
Random discussions -

Currently the tree structure is devided into several elements,
One for 2-byte, other ones for 3-byte and 4-byte codes and output
table. The other than the last one is logically and technically
merged into single table but it makes the generator script far
complex than the current complexity. I no longer want to play
hide'n seek with complex perl object..

It might be better that combining this as a native feature of the
core. Currently the helper function is in core but that function
is given as conv_func on calling LocalToUtf.

Current implement uses *.map files of pg_utf_to_local as
input. It seems not good but the radix tree files is completely
uneditable. Provide custom made loading functions for every
source instead of load_chartable() would be the way to go.

# However, for example utf8_to_sjis.map, it doesn't seem to have
# generated from the source mentioned in UCS_to_SJIS.pl

I'm not sure that compilers other than gcc accepts generated map
file content.

The RADIXTREE.pm is in rather older style but seem no problem.

I haven't tried this for charsets that contains 4-byte code.

I haven't consider charset with conbined characters. I don't
think it is needed so immediately.

Though I believe that this is easily applied to other
conversions, I tried this only with character sets that I know
about it.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Radix-tree-infrastructure-for-character-encoding.patch text/x-patch 23.9 KB
0002-Use-radix-tree-for-UTF8-ShiftJIS-conversion.patch.gz application/octet-stream 65.7 KB
0003-Use-radix-tree-for-UTF8-EUC_JP-conversion.patch.gz application/octet-stream 102.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-10-07 08:49:50 Re: VACUUM's ancillary tasks
Previous Message Rajkumar Raghuwanshi 2016-10-07 08:33:42 Re: Declarative partitioning - another take