Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: hlinnaka(at)iki(dot)fi, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-01-26 12:42:12
Message-ID: 20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for looking this.

At Thu, 26 Jan 2017 16:28:16 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqREL1fsDBGv4zRvaXY+UKtS0wzkamJcnYhX0--OZvpUUQ(at)mail(dot)gmail(dot)com>
> On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > [...patch...]
>
> Nobody has showed up yet to review this patch, so I am giving it a shot.
>
> The patch file sizes are scary at first sight, but after having a look:
> 36 files changed, 1411 insertions(+), 54398 deletions(-)
> Yes that's a surprise, something like git diff --irreversible-delete
> would have helped as most of the diffs are just caused by 3 files
> being deleted in patch 0004, making 50k lines going to the abyss of
> deletion.

Thank you. Good to hear that. I'll try that at the next chance.

> > Hello, I found a bug in my portion while rebasing.
>
> Right, that's 0001. Nice catch.
>
> > The attached files are the following. This patchset is not
> > complete missing changes of map files. The change is tremendously
> > large but generatable.
> >
> > 0001-Add-missing-semicolon.patch
> >
> > UCS_to_EUC_JP.pl has a line missing teminating semicolon. This
> > doesn't harm but surely a syntax error. This patch fixes it.
> > This might should be a separate patch.
>
> This requires a back-patch. This makes me wonder how long this script
> has actually not run...
>
> > 0002-Correct-reference-resolution-syntax.patch
> >
> > convutils.pm has lines with different syntax of reference
> > resolution. This unifies the syntax.
>
> Yes that looks right to me.

Yes, I thoght that the three patches can be back-patched, a kind
of bug fix.

> I am the best perl guru on this list but
> looking around $$var{foo} is bad, ${$var}{foo} is better, and
> $var->{foo} is even better. This also generates no diffs when running
> make in src/backend/utils/mb/Unicode/. So no objections to that.

Thank you for the explanation. I think no '$$'s is left alone.

> > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> >
> > Before adding radix tree stuff, applied pgperltidy and inserted
> > format-skipping pragma for the parts where perltidy seems to do
> > too much.
>
> Which version of perltidy did you use? Looking at the archives, the
> perl code is cleaned up with a specific version, v20090616. See
> https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> for example on the matter. As perltidy changes over time, this may be
> a sensitive change if done this way.

Hmm. I will make a confirmation on that.. tomorrow.

> > 0004-Use-radix-tree-for-character-conversion.patch
> >
> > Radix tree body.
>
> Well, here a lot of diffs could have been saved.
>
> > The unattached fifth patch is generated by the following steps.
> >
> > [$(TOP)]$ ./configure
> > [Unicode]$ make
> > [Unicode]$ make distclean
> > [Unicode]$ git add .
> > [Unicode]$ commit
> > === COMMITE MESSSAGE
> > Replace map files with radix tree files.
> >
> > These encodings no longer uses the former map files and uses new radix
> > tree files.
> > ===
>
> OK, I can see that working, with 200k of maps generated.. So going
> through the important bits of this jungle..

Many thaks for the exploration.

> +/*
> + * radix tree conversion function - this should be identical to the function in
> + * ../conv.c with the same name
> + */
> +static inline uint32
> +pg_mb_radix_conv(const pg_mb_radix_tree *rt,
> + int l,
> + unsigned char b1,
> + unsigned char b2,
> + unsigned char b3,
> + unsigned char b4)
> This is not nice. Having a duplication like that is a recipe to forget
> about it as this patch introduces a dependency with conv.c and the
> radix tree generation.

Mmmmm. I agree to you, but conv.c contains unwanted reference to
elog or other sutff of the core. Separating the function in a
dedicate source file named such as "../pg_mb_radix_conv.c" will
work. If it is not so bad, I'll do that in the next version.

> Having a .gitignore in Unicode/ would be nice, particularly to avoid
> committing map_checker.
>
> A README documenting things may be welcome, or at least comments at
> the top of map_checker.c. Why is map_checker essential? What does it
> do? There is no way to understand that easily, except that it includes
> a "radix tree conversion function", and that it performs sanity checks
> on the radix trees to be sure that they are on a good shape. But as
> this something that one would guess only after looking at your patch
> and the code (at least I will sleep less stupid tonight after reading
> this stuff).

Okay, I'll do that.

> --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> #<<< do not let perltidy touch this
> -my @reject_sjis =(
> +my @reject_sjis = (
> 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> 0x879a..0x879c
> -);
> + );
> This is not generated, it would be nice to drop the noise from the patch.

Mmm. I'm not sure how this is generated but I'll care for that.

> Here is another one:
> - $i->{code} = $jis | (
> - $jis < 0x100
> - ? 0x8e00
> - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> -
> +#<<< do not let perltidy touch this
> + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> + ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> +#>>>

Ok. Will revert this.

> if (l == 2)
> {
> - iutf = *utf++ << 8;
> - iutf |= *utf++;
> + b3 = *utf++;
> + b4 = *utf++;
> }
> Ah, OK. This conversion is important so as it performs a minimum of
> bitwise operations. Yes let's keep that. That's pretty cool to get a
> faster operation.

It is Heikki's work:p

I'll address them and repost the next version sooner.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2017-01-26 12:45:22 Re: pg_hba_file_settings view patch
Previous Message Simon Riggs 2017-01-26 12:20:06 Re: Replication slot xmin is not reset if HS feedback is turned off while standby is shut down