Re: Supporting SJIS as a database encoding

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Supporting SJIS as a database encoding
Date: 2016-09-21 06:14:27
Message-ID: 20160921.151427.265121484.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

At Tue, 13 Sep 2016 11:44:01 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <7ff67a45-a53e-4d38-e25d-3a121afea47c(at)iki(dot)fi>
> On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:
> > Returning in UTF-8 bloats the result string by about 1.5 times so
> > it doesn't seem to make sense comparing with it. But it takes
> > real = 47.35s.
>
> Nice!

Thanks!

> I was hoping that this would also make the binaries smaller. A few
> dozen kB of storage is perhaps not a big deal these days, but
> still. And smaller tables would also consume less memory and CPU
> cache.

Agreed.

> I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that
> the old table isn't included anymore, compiled, and ran "strip
> utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's
> 160 kB. So the radix tree takes a little bit more space.
>
> That's not too bad, and I'm sure we could live with that, but with a
> few simple tricks, we could do better. First, since all the values we
> store in the tree are < 0xffff, we could store them in int16 instead
> of int32, and halve the size of the table right off the bat. won't work
> for all encodings, of course, but it might be worth it to
> have two versions of the code, one for int16 and another for int32.

That's right. I used int imprudently. All of the character in the
patch, and most of characters in other than Unicode-related
encodings are in 2 bytes. 3 bytes characters can be in separate
table in the struct for the case. Othersise two or more versions
of the structs is possible since currently the radix struct is
utf8_and_sjis's own in spite of the fact that it is in pg_wchar.h
in the patch.

> Another trick is to eliminate redundancies in the tables. Many of the
> tables contain lots of zeros, as in:
>
> > /* c3xx */{
...
> > 0x817e,
> > /* c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x8180,
> > /* c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000
> > },
>
> and
>
> > /* e388xx */{
> > /* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
> > /* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> > 0x0000,
...
> > },
>
> You could overlay the last row of the first table, which is all zeros,
> with the first row of the second table, which is also all zeros. (Many
> of the tables have a lot more zero-rows than this example.)

Yes, the bunch of zeros was annoyance. Several or many
compression techniques are available in exchange for some
additional CPU time. But the technique you suggested doesn't
need such sacrifice, sounds nice.

> But yes, this patch looks very promising in general. I think we should
> switch over to radix trees for the all the encodings.

The result was more than I expected for a character set with
about 7000 characters. We can expect certain amount of advangate
even for character sets that have less than a hundred of
characters.

I'll work on this for the next CF.

Thanks.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-09-21 06:14:55 Re: pageinspect: Hash index support
Previous Message Amit Kapila 2016-09-21 06:04:48 Re: Speed up Clog Access by increasing CLOG buffers