Re: Radix tree for character conversion

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Radix tree for character conversion
Date: 2017-02-22 07:06:14
Message-ID: CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 3, 2017 at 1:18 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Thanks to that Heikki have pushed the first two patches and a
> part of the third, only one patch is remaining now.
>
> # Sorry for not separating KOI8 stuffs.
>
> At Tue, 31 Jan 2017 19:06:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170131(dot)190609(dot)254672218(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
>> > Thanks for the new version, I'll look at it once I am done with the
>> > cleanup of the current CF. For now I have moved it to the CF 2017-03.
>>
>> Agreed. Thank you.
>
> Attached is the latest version on the current master (555494d).
>
> Note: since this patch is created by git diff --irreversble-delete,
> three files mb/Unicode/*.(txt|xml) to be deleted are left alone.

Thanks for the rebase. I have been spending sore time looking at this
patch. The new stuff in convutils.pm is by far the interesting part of
the patch, where the building of the radix trees using a byte
structure looks in pretty good shape after eyeballing the logic for a
couple of hours.

+# ignore backup files of editors
+/*[~#]
+
This does not belong to Postgres core code. You could always set up
that in a global exclude file with core.excludefiles.

In order to conduct sanity checks on the shape of the radix tree maps
compared to the existing maps, having map_checker surely makes sense.
Now in the final result I don't think we need it. The existing map
files ought to be replaced by their radix versions at the end, and
map_checker should be removed. This leads to a couple of
simplifications, like Makefile, and reduces the maintenance to one
mechanism.

+sub print_radix_trees
+{
+ my ($this_script, $csname, $charset) = @_;
+
+ &print_radix_map($this_script, $csname, "from_unicode", $charset, 78);
+ &print_radix_map($this_script, $csname, "to_unicode", $charset, 78);
+}
There is no need for the table width to be defined as a variable (5th
argument). Similarly, to_unicode/from_unicode require checks in
multiple places, this could be just a simple boolean flag. Or if you
want to go to the road of non-simple things, you could have two
arguments: an origin and a target. If one is UTF8 the other is the
mapping name.

+sub dump_charset
+{
+ my ($list, $filt) = @_;
+
+ foreach my $i (@$list)
+ {
+ next if (defined $filt && !&$filt($i));
+ if (!defined $i->{ucs}) { $i->{ucs} = &utf2ucs($i->{utf8}); }
+ printf "ucs=%x, code=%x, direction=%s %s:%d %s\n",
+ $i->{ucs}, $i->{code}, $i->{direction},
+ $i->{f}, $i->{l}, $i->{comment};
+ }
+}
This is used nowhere. Perhaps it was useful for debugging at some point?

+# make_charmap - convert charset table to charmap hash
+# with checking duplicate source code
Maybe this should be "with checking of duplicated source codes".

+# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth)
+#
+# this_script - the name of the *caller script* of this feature
$this_script is not needed at all, you could just use basename($0) and
reduce the number of arguments of the different functions of the
stack.

+ ### amount of zeros that can be ovarlaid.
s/ovarlaid/overlaid.

+# make_mapchecker.pl - Gerates map_checker.h file included by map_checker.c
s/gerates/generates/

+ if (s < 0x80)
+ {
+ fprintf(stderr, "\nASCII character ? (%x)", s);
+ exit(1);
+ }
Most likely a newline at the end of the error string is better here.

+ $charmap{ ucs2utf($src) } = $dst;
+ }
+
+ }
Unnecessary newline here.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2017-02-22 07:06:35 GRANT EXECUTE ON FUNCTION foo() TO bar();
Previous Message Robert Haas 2017-02-22 06:55:16 Re: Passing query string to workers