Re: Radix tree for character conversion

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Radix tree for character conversion
Date: 2017-01-26 07:28:16
Message-ID: CAB7nPqREL1fsDBGv4zRvaXY+UKtS0wzkamJcnYhX0--OZvpUUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> [...patch...]

Nobody has showed up yet to review this patch, so I am giving it a shot.

The patch file sizes are scary at first sight, but after having a look:
36 files changed, 1411 insertions(+), 54398 deletions(-)
Yes that's a surprise, something like git diff --irreversible-delete
would have helped as most of the diffs are just caused by 3 files
being deleted in patch 0004, making 50k lines going to the abyss of
deletion.

> Hello, I found a bug in my portion while rebasing.

Right, that's 0001. Nice catch.

> The attached files are the following. This patchset is not
> complete missing changes of map files. The change is tremendously
> large but generatable.
>
> 0001-Add-missing-semicolon.patch
>
> UCS_to_EUC_JP.pl has a line missing teminating semicolon. This
> doesn't harm but surely a syntax error. This patch fixes it.
> This might should be a separate patch.

This requires a back-patch. This makes me wonder how long this script
has actually not run...

> 0002-Correct-reference-resolution-syntax.patch
>
> convutils.pm has lines with different syntax of reference
> resolution. This unifies the syntax.

Yes that looks right to me. I am the best perl guru on this list but
looking around $$var{foo} is bad, ${$var}{foo} is better, and
$var->{foo} is even better. This also generates no diffs when running
make in src/backend/utils/mb/Unicode/. So no objections to that.

> 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
>
> Before adding radix tree stuff, applied pgperltidy and inserted
> format-skipping pragma for the parts where perltidy seems to do
> too much.

Which version of perltidy did you use? Looking at the archives, the
perl code is cleaned up with a specific version, v20090616. See
https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
for example on the matter. As perltidy changes over time, this may be
a sensitive change if done this way.

> 0004-Use-radix-tree-for-character-conversion.patch
>
> Radix tree body.

Well, here a lot of diffs could have been saved.

> The unattached fifth patch is generated by the following steps.
>
> [$(TOP)]$ ./configure
> [Unicode]$ make
> [Unicode]$ make distclean
> [Unicode]$ git add .
> [Unicode]$ commit
> === COMMITE MESSSAGE
> Replace map files with radix tree files.
>
> These encodings no longer uses the former map files and uses new radix
> tree files.
> ===

OK, I can see that working, with 200k of maps generated.. So going
through the important bits of this jungle..

+/*
+ * radix tree conversion function - this should be identical to the function in
+ * ../conv.c with the same name
+ */
+static inline uint32
+pg_mb_radix_conv(const pg_mb_radix_tree *rt,
+ int l,
+ unsigned char b1,
+ unsigned char b2,
+ unsigned char b3,
+ unsigned char b4)
This is not nice. Having a duplication like that is a recipe to forget
about it as this patch introduces a dependency with conv.c and the
radix tree generation.

Having a .gitignore in Unicode/ would be nice, particularly to avoid
committing map_checker.

A README documenting things may be welcome, or at least comments at
the top of map_checker.c. Why is map_checker essential? What does it
do? There is no way to understand that easily, except that it includes
a "radix tree conversion function", and that it performs sanity checks
on the radix trees to be sure that they are on a good shape. But as
this something that one would guess only after looking at your patch
and the code (at least I will sleep less stupid tonight after reading
this stuff).

--- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
# Drop these SJIS codes from the source for UTF8=>SJIS conversion
#<<< do not let perltidy touch this
-my @reject_sjis =(
+my @reject_sjis = (
0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
- 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
+ 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
0x879a..0x879c
-);
+ );
This is not generated, it would be nice to drop the noise from the patch.

Here is another one:
- $i->{code} = $jis | (
- $jis < 0x100
- ? 0x8e00
- : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
-
+#<<< do not let perltidy touch this
+ $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
+ ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
+#>>>

if (l == 2)
{
- iutf = *utf++ << 8;
- iutf |= *utf++;
+ b3 = *utf++;
+ b4 = *utf++;
}
Ah, OK. This conversion is important so as it performs a minimum of
bitwise operations. Yes let's keep that. That's pretty cool to get a
faster operation.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-26 07:29:10 Re: Radix tree for character conversion
Previous Message Kyotaro HORIGUCHI 2017-01-26 07:16:56 Re: Radix tree for character conversion