Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, michael(dot)paquier(at)gmail(dot)com, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-03-27 10:05:43
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hmm, things are bit different.

At Thu, 23 Mar 2017 12:13:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170323(dot)121307(dot)241436413(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Ok, I'll write a small script to generate a set of "conversion
> > dump" and try to write README.sanity_check describing how to use
> > it.
> I found that there's no way to identify the character domain of a
> conversion on SQL interface. Unconditionally giving from 0 to
> 0xffffffff as a bytea string yields too-bloat result by containg
> many bogus lines. (If \x40 is a character, convert() also
> accepts \x4040, \x404040 and \x40404040)
> One more annoyance is the fact that mappings and conversion
> procedures are not in one-to-one correspondence. The
> corresnponcence is hidden in conversion_procs/*.c files so we
> should extract it from them or provide as knowledge. Both don't
> seem good.
> Finally, it seems that I have no choice than resurrecting
> map_checker. The exactly the same one no longer works but
> map_dumper.c with almost the same structure will work.
> If no one objects to adding map_dumper.c and
> (tentavie name, of course), I'll make a
> patch to do that.

The scirpt or executable should be compatible between versions
but pg_mb_radix_conv is not. On the other hand more upper level
API reuiqres server stuff.

Finally I made an extension that dumps encoding conversion.

encoding_dumper('SJIS', 'UTF-8') or encoding_dumper(35, 6)

Then it returns the following output consists of two BYTEAs.

srccode | dstcode
\x01 | \x01
\x02 | \x02
\xfc4a | \xe9b899
\xfc4b | \xe9bb91
(7914 rows)

This returns in a very short time but doesn't when srccode
extends to 4 bytes. As an extreme example the following,

> =# select * from encoding_dumper('UTF-8', 'LATIN1');

takes over 2 minutes to return only 255 rows. We cannot determine
the exact domain without looking into map data so the function
cannot do other than looping through all the four-byte values.
Providing a function that gives the domain for a conversion was a
mess, especially for artithmetic-conversions. The following query
took 94 minutes to give 25M lines/125MB. In short, that's a
crap. (the first attached)

SELECT x.conname, y.srccode, y.dstcode
SELECT conname, conforencoding, contoencoding
FROM pg_conversion c
WHERE pg_char_to_encoding('UTF-8') IN (c.conforencoding, c.contoencoding)
AND pg_char_to_encoding('SQL_ASCII')
NOT IN (c.conforencoding, c.contoencoding)) as x,
SELECT srccode, dstcode
FROM encoding_dumper(x.conforencoding, x.contoencoding)) as y
ORDER BY x.conforencoding, x.contoencoding, y.srccode;

As the another way, I added a measure to generate plain mapping
lists corresponding to .map files (similar to old maps but
simpler) and this finishes the work within a second.

$ make mapdumps

If we will not shortly change the framework of mapped character
conversion, the dumper program may be useful but I'm not sure
this is reasonable as sanity check for future modifications. In
the PoC, pg_mb_radix_tree() is copied into map_checker.c but this
needs to be a separate file again. (the second attached)


Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-encoding_dumper.patch text/x-patch 7.3 KB
0002-map_dumper.patch text/x-patch 8.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2017-03-27 10:49:37 Re: pgbench - allow to store select results into variables
Previous Message Stas Kelvich 2017-03-27 09:53:01 Re: logical decoding of two-phase transactions