From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
Cc: | tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Radix tree for character conversion |
Date: | 2016-10-07 10:46:31 |
Message-ID: | af224134-80dc-b18e-54f8-d45504754fc0@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/07/2016 11:36 AM, Kyotaro HORIGUCHI wrote:
> The radix conversion function and map conversion script became
> more generic than the previous state. So I could easily added
> radix conversion of EUC_JP in addition to SjiftJIS.
>
> nm -S said that the size of radix tree data for sjis->utf8
> conversion is 34kB and that for utf8->sjis is 46kB. (eucjp->utf8
> 57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
> 59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
> not missing something, radix tree is faster and require less
> memory.
Cool!
> Currently the tree structure is devided into several elements,
> One for 2-byte, other ones for 3-byte and 4-byte codes and output
> table. The other than the last one is logically and technically
> merged into single table but it makes the generator script far
> complex than the current complexity. I no longer want to play
> hide'n seek with complex perl object..
I think that's OK. There isn't really anything to gain by merging them.
> It might be better that combining this as a native feature of the
> core. Currently the helper function is in core but that function
> is given as conv_func on calling LocalToUtf.
Yeah, I think we want to completely replace the current binary-search
based code with this. I would rather maintain just one mechanism.
> Current implement uses *.map files of pg_utf_to_local as
> input. It seems not good but the radix tree files is completely
> uneditable. Provide custom made loading functions for every
> source instead of load_chartable() would be the way to go.
>
> # However, for example utf8_to_sjis.map, it doesn't seem to have
> # generated from the source mentioned in UCS_to_SJIS.pl
Ouch. We should find and document an authoritative source for all the
mappings we have...
I think the next steps here are:
1. Find an authoritative source for all the existing mappings.
2. Generate the radix tree files directly from the authoritative
sources, instead of the existing *.map files.
3. Completely replace the existing binary-search code with this.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2016-10-07 11:39:09 | Re: pg_rewind: Should abort if both --source-pgdata and --source-server are specified |
Previous Message | Michael Banck | 2016-10-07 10:34:14 | pg_rewind: Should abort if both --source-pgdata and --source-server are specified |