Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, michael(dot)paquier(at)gmail(dot)com, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-03-17 05:19:31
Message-ID: 20170317.141931.188239181.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Thank you for committing this.

At Mon, 13 Mar 2017 21:07:39 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <d5b70078-9f57-0f63-3462-1e564a57739f(at)iki(dot)fi>
> On 03/13/2017 08:53 PM, Tom Lane wrote:
> > Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> >> It would be nice to run the map_checker tool one more time, though, to
> >> verify that the mappings match those from PostgreSQL 9.6.
> >
> > +1
> >
> >> Just to be sure, and after that the map checker can go to the dustbin.
> >
> > Hm, maybe we should keep it around for the next time somebody has a
> > bright
> > idea in this area?
>
> The map checker compares old-style maps with the new radix maps. The
> next time 'round, we'll need something that compares the radix maps
> with the next great thing. Not sure how easy it would be to adapt.
>
> Hmm. A somewhat different approach might be more suitable for testing
> across versions, though. We could modify the perl scripts slightly to
> print out SQL statements that exercise every mapping. For every
> supported conversion, the SQL script could:
>
> 1. create a database in the source encoding.
> 2. set client_encoding='<target encoding>'
> 3. SELECT a string that contains every character in the source
> encoding.

There are many encodings that can be client-encoding but cannot
be database-encoding. And some encodings such as UTF-8 has
several one-way conversion. If we do something like this, it
would be as the following.

1. Encoding test
1-1. create a database in UTF-8
1-2. set client_encoding='<source encoding>'
1-3. INSERT all characters defined in the source encoding.
1-4. set client_encoding='UTF-8'
1-5. SELECT a string that contains every character in UTF-8.
2. Decoding test

.... sucks!

I would like to use convert() function. It can be a large
PL/PgSQL function or a series of "SELECT convert(...)"s. The
latter is doable on-the-fly (by not generating/storing the whole
script).

| -- Test for SJIS->UTF-8 conversion
| ...
| SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
| ...
| SELECT convert('\897e', 'SJIS', 'UTF-8');

> You could then run those SQL statements against old and new server
> version, and verify that you get the same results.

Including the result files in the repository will make this easy
but unacceptably bloats. Put mb/Unicode/README.sanity_check?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-03-17 05:23:13 Re: Protect syscache from bloating with negative cache entries
Previous Message Michael Paquier 2017-03-17 05:14:50 Re: Speedup twophase transactions