Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, michael(dot)paquier(at)gmail(dot)com, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-03-23 03:13:07
Message-ID: 20170323.121307.241436413.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 21 Mar 2017 13:10:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170321(dot)131048(dot)150321071(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <01efd334-b839-0450-1b63-f2dea9326a7e(at)iki(dot)fi>
> > On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote:
> > > I would like to use convert() function. It can be a large
> > > PL/PgSQL function or a series of "SELECT convert(...)"s. The
> > > latter is doable on-the-fly (by not generating/storing the whole
> > > script).
> > >
> > > | -- Test for SJIS->UTF-8 conversion
> > > | ...
> > > | SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
> > > | ...
> > > | SELECT convert('\897e', 'SJIS', 'UTF-8');
> >
> > Makes sense.
> >
> > >> You could then run those SQL statements against old and new server
> > >> version, and verify that you get the same results.
> > >
> > > Including the result files in the repository will make this easy
> > > but unacceptably bloats. Put mb/Unicode/README.sanity_check?
> >
> > Yeah, a README with instructions on how to do sounds good. No need to
> > include the results in the repository, you can run the script against
> > an older version when you need something to compare with.
>
> Ok, I'll write a small script to generate a set of "conversion
> dump" and try to write README.sanity_check describing how to use
> it.

I found that there's no way to identify the character domain of a
conversion on SQL interface. Unconditionally giving from 0 to
0xffffffff as a bytea string yields too-bloat result by containg
many bogus lines. (If \x40 is a character, convert() also
accepts \x4040, \x404040 and \x40404040)

One more annoyance is the fact that mappings and conversion
procedures are not in one-to-one correspondence. The
corresnponcence is hidden in conversion_procs/*.c files so we
should extract it from them or provide as knowledge. Both don't
seem good.

Finally, it seems that I have no choice than resurrecting
map_checker. The exactly the same one no longer works but
map_dumper.c with almost the same structure will work.

If no one objects to adding map_dumper.c and
gen_mapdumper_header.pl (tentavie name, of course), I'll make a
patch to do that.

Any suggestions?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-03-23 03:13:20 Re: segfault in hot standby for hash indexes
Previous Message Amit Kapila 2017-03-23 02:34:54 Re: pageinspect and hash indexes