Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: hlinnaka(at)iki(dot)fi, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-02-28 08:34:02
Message-ID: 20170228.173402.41146343.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqR49krGP6qaaKaL2v3HCnn+dnzv8Dq_ySGbDSr6b_ywrw(at)mail(dot)gmail(dot)com>
> On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA(at)mail(dot)gmail(dot)com>
> >> In order to conduct sanity checks on the shape of the radix tree maps
> >> compared to the existing maps, having map_checker surely makes sense.
> >> Now in the final result I don't think we need it. The existing map
> >> files ought to be replaced by their radix versions at the end, and
> >> map_checker should be removed. This leads to a couple of
> >> simplifications, like Makefile, and reduces the maintenance to one
> >> mechanism.
> >
> > Hmm.. Though I don't remember clearly what the radix map of the
> > first version looked like, the current radix map seems
> > human-readable for me. It might be by practice or by additional
> > comments in map files. Anyway I removed all of the stuff so as
> > not to generate the plain maps. But I didn't change the names of
> > _radix.map and just commented out the line to output the plain
> > maps in UCS_to_*.pl. Combined maps are still in the plain format
> > so print_tables was changed to take character tables separately
> > for regular (non-combined) characters and combined characters.
>
> Do others have thoughts to offer on the matter? I would think that the
> new radix maps should just replace by the old plain ones, and that the
> only way to build the maps going forward is to use the new methods.
> The radix trees is the only thing used in the backend code as well
> (conv.c). We could keep the way to build the old maps, with the
> map_checker in module out of the core code. FWIW, I am fine to add the
> old APIs in my plugin repository on github and have the sanity checks
> in that as well. And of course also publish on this thread a module to
> do that.

I couldn't make out my mind to move to radix tree completely, but
UtfToLocal/LocalToUtf no longer handle the "plain map"s for
non-combined character so they have lost their planground. Okay,
I think I removed all the trace of the plain map era.

Every characters in a mapping has a comment that describes what
the character is or where it is defined. This information is no
longer useful (radix map doesn't have a plance to show it) but
I left it for debug use. (This might just be justification..)

> > - Split the property {direction} into two boolean properties
> > {to_unicode} and {from_unicode}.
> >
> > - Make the {direction} property an integer and compared with
> > defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using
> > the '=' operator.
> >
> > I choosed the former in this patch.
>
> Fine for me.

Thanks.

> >> + $charmap{ ucs2utf($src) } = $dst;
> >> + }
> >> +
> >> + }
> >> Unnecessary newline here.
> >
> > removed in convutils.pm.
> >
> > Since Makefile ignores old .map files, the steps to generate a
> > patch for map files was a bit chaged.
> >
> > $ rm *.map
> > $ make distclean maintainer-clean all
> > $ make distclean
> > $ git add .
> > $ git commit
>
> +# ignore generated files
> +/map_checker
> +/map_checker.h
> [...]
> +map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS)
> + $(PERL) $<
> +
> +map_checker.o: map_checker.c map_checker.h ../char_converter.c
> +
> +map_checker: map_checker.o
> With map_checker out of the game, those things are not needed.

Ouch! Thanks for pointing out it. Removed.

> +++ b/src/backend/utils/mb/char_converter.c
> @@ -0,0 +1,116 @@
> +/*-------------------------------------------------------------------------
> + *
> + * Character converter function using radix tree
> In the simplified version of the patch, pg_mb_radix_conv() being only
> needed in conv.c I think that this could just be a static local
> routine.
>
> -#include "../../Unicode/utf8_to_koi8r.map"
> -#include "../../Unicode/koi8r_to_utf8.map"
> -#include "../../Unicode/utf8_to_koi8u.map"
> -#include "../../Unicode/koi8u_to_utf8.map"
> +#include "../../Unicode/utf8_to_koi8r_radix.map"
> +#include "../../Unicode/koi8r_to_utf8_radix.map"
> +#include "../../Unicode/utf8_to_koi8u_radix.map"
> +#include "../../Unicode/koi8u_to_utf8_radix.map"
> FWIW, I am fine to use those new names as include points.
>
> -distclean: clean
> +distclean:
> rm -f $(TEXTS)
> -maintainer-clean: distclean
> +# maintainer-clean intentionally leaves $(TEXTS)
> +maintainer-clean:
> Why is that? There is also a useless diff down that code block.

It *was* for convenience but now it is automatically downloaded
so such distinction donsn't offer anything good. Changed it to
remove $(TEXTS).

> +conv.o: conv.c char_converter.c
> This also can go away.

Touching char_converter.c will be ignored if it is removed. Did
you mistake it for map_checker?

> -print_tables("EUC_JIS_2004", \(at)all, 1);
> +# print_tables("EUC_JIS_2004", \(at)regular, undef, 1);
> +print_radix_trees("EUC_JIS_2004", \(at)regular);
> +print_tables("EUC_JIS_2004", undef, \(at)combined, 1);
> [...]
> sub print_tables
> {
> - my ($charset, $table, $verbose) = @_;
> + my ($charset, $regular, $combined, $verbose) = @_;
> print_tables is only used for combined maps, you could remove $regular
> from it and just keep $combined around, perhaps renaming print_tables
> to print_combined_maps?

Renamed to print_combied_maps.

And the code-comment pointed in the comment by the previous mail
is rewritten as Robert's suggestion.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Use-radix-tree-for-character-conversion.patch.gz application/octet-stream 16.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?= 2017-02-28 09:12:51 Re: timeouts in PostgresNode::psql
Previous Message Simon Riggs 2017-02-28 08:25:54 Re: Documentation improvements for partitioning