Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: hlinnaka(at)iki(dot)fi, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-01-27 08:33:57
Message-ID: 20170127.173357.221584433.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, this is an intermediate report without a patch.

At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170126(dot)214212(dot)111556326(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > >
> > > Before adding radix tree stuff, applied pgperltidy and inserted
> > > format-skipping pragma for the parts where perltidy seems to do
> > > too much.
> >
> > Which version of perltidy did you use? Looking at the archives, the
> > perl code is cleaned up with a specific version, v20090616. See
> > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> > for example on the matter. As perltidy changes over time, this may be
> > a sensitive change if done this way.
>
> Hmm. I will make a confirmation on that.. tomorrow.

My perltidy -v said "v20121207'. Anyway, I gave up to apply
perltidy by myself. So I'll just drop 0003 and new 0004 (name
changed to 0003) is made immediately on 0002.

> > > 0004-Use-radix-tree-for-character-conversion.patch
> > >
> > > Radix tree body.
> >
> > Well, here a lot of diffs could have been saved.
> >
> > > The unattached fifth patch is generated by the following steps.
> > >
> > > [$(TOP)]$ ./configure
> > > [Unicode]$ make
> > > [Unicode]$ make distclean
> > > [Unicode]$ git add .
> > > [Unicode]$ commit
> > > === COMMITE MESSSAGE
> > > Replace map files with radix tree files.
> > >
> > > These encodings no longer uses the former map files and uses new radix
> > > tree files.
> > > ===
> >
> > OK, I can see that working, with 200k of maps generated.. So going
> > through the important bits of this jungle..
>
> Many thaks for the exploration.
>
> > +/*
> > + * radix tree conversion function - this should be identical to the function in
> > + * ../conv.c with the same name
..
> > This is not nice. Having a duplication like that is a recipe to forget
> > about it as this patch introduces a dependency with conv.c and the
> > radix tree generation.

In the attatched patch, mb/char_conveter.c which contains one
inline function is created and it is includ'ed from mb/conv.c and
mb/Unicode/map_checker.c.

> > Having a .gitignore in Unicode/ would be nice, particularly to avoid
> > committing map_checker.

I missed this. I added .gitignore to ignore map_checker stuff
and authority files and old-style map files.

> > A README documenting things may be welcome, or at least comments at
> > the top of map_checker.c. Why is map_checker essential? What does it
> > do? There is no way to understand that easily, except that it includes
> > a "radix tree conversion function", and that it performs sanity checks
> > on the radix trees to be sure that they are on a good shape. But as
> > this something that one would guess only after looking at your patch
> > and the code (at least I will sleep less stupid tonight after reading
> > this stuff).
>
> Okay, I'll do that.

The patch has not been provided yet, I'm going to put the
following comment just before the main() in map_checker.c.

/*
* The old-style plain map files were error-resistant due to its
* straight-forward way for generation from authority files. In contrast the
* radix tree maps are generated by a rather complex calculation and have a
* complex, hard-to-confirm format.
*
* This program runs sanity check of the radix tree maps by confirming all
* characters in the plain map files to be converted to the same code by the
* corresponding radix tree map.
*
* All map files are included by map_checker.h that is generated by the script
* make_mapchecker.pl as the variable mappairs.
*
*/

I'll do the following things later.

> > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> > #<<< do not let perltidy touch this
> > -my @reject_sjis =(
> > +my @reject_sjis = (
> > 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> > - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > 0x879a..0x879c
> > -);
> > + );
> > This is not generated, it would be nice to drop the noise from the patch.
>
> Mmm. I'm not sure how this is generated but I'll care for that.
>
> > Here is another one:
> > - $i->{code} = $jis | (
> > - $jis < 0x100
> > - ? 0x8e00
> > - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > -
> > +#<<< do not let perltidy touch this
> > + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> > + ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > +#>>>
>
> Ok. Will revert this.
>
> > if (l == 2)
> > {
> > - iutf = *utf++ << 8;
> > - iutf |= *utf++;
> > + b3 = *utf++;
> > + b4 = *utf++;
> > }
> > Ah, OK. This conversion is important so as it performs a minimum of
> > bitwise operations. Yes let's keep that. That's pretty cool to get a
> > faster operation.
>
> It is Heikki's work:p
>
> I'll address them and repost the next version sooner.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2017-01-27 08:35:37 Re: Failure in commit_ts tap tests
Previous Message Venkata B Nagothi 2017-01-27 08:19:04 Re: patch proposal