Re: Supporting SJIS as a database encoding

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Supporting SJIS as a database encoding
Date: 2016-09-13 08:44:01
Message-ID: 7ff67a45-a53e-4d38-e25d-3a121afea47c@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:
> Returning in UTF-8 bloats the result string by about 1.5 times so
> it doesn't seem to make sense comparing with it. But it takes
> real = 47.35s.

Nice!

I was hoping that this would also make the binaries smaller. A few dozen
kB of storage is perhaps not a big deal these days, but still. And
smaller tables would also consume less memory and CPU cache.

I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that
the old table isn't included anymore, compiled, and ran "strip
utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's
160 kB. So the radix tree takes a little bit more space.

That's not too bad, and I'm sure we could live with that, but with a few
simple tricks, we could do better. First, since all the values we store
in the tree are < 0xffff, we could store them in int16 instead of int32,
and halve the size of the table right off the bat. That won't work for
all encodings, of course, but it might be worth it to have two versions
of the code, one for int16 and another for int32.

Another trick is to eliminate redundancies in the tables. Many of the
tables contain lots of zeros, as in:

> /* c3xx */{
> /* c380 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* c388 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* c390 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x817e,
> /* c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x8180,
> /* c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
> },

and

> /* e388xx */{
> /* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e388a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e388b0 */ 0x0000, 0xfa58, 0x878b, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> /* e388b8 */ 0x0000, 0x878c, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
> },

You could overlay the last row of the first table, which is all zeros,
with the first row of the second table, which is also all zeros. (Many
of the tables have a lot more zero-rows than this example.)

But yes, this patch looks very promising in general. I think we should
switch over to radix trees for the all the encodings.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2016-09-13 09:07:09 Re: WAL consistency check facility
Previous Message Fabien COELHO 2016-09-13 08:41:28 Re: pgbench - allow to store select results into variables