Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2016-12-05 10:29:54
Message-ID: 20161205.192954.121855559.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, thank you for reviewing this.

I compared mine and yours. The new patch works fine and gives
smaller radix map files. It seems also to me more readable.

At Fri, 2 Dec 2016 22:07:07 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <da58a154-0b28-802e-5e82-5a205f53926e(at)iki(dot)fi>
> On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
> > Thanks. The attached patch contains the patch by perlcritic.
> >
> > 0001,2,3 are Heikki's patch that are not modified since it is
> > first proposed. It's a bit too big so I don't attach them to this
> > mail (again).
> >
> > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
>
> I've now pushed these preliminary patches, with the applicable fixes
> from you and Daniel. The attached patch is now against git master.

Thanks for committing them.

> > 0004 is radix-tree stuff, applies on top of the three patches
> > above.
>
> I've spent the last couple of days reviewing this. While trying to
> understand how it works, I ended up dismantling, rewriting, and
> putting back together most of the added perl code.

I might have been putting too much in one structure and a bit too
eager to conceal lower level from the upper level.

> Attached is a new
> version, with more straightforward logic, making it more
> understandable. I find it more understandable, anyway, I hope it's not
> only because I wrote it myself :-). Let me know what you think.

First, thank you for the refactoring(?).

I didn't intend to replace all of .map files with radix files at
first. Finally my patch removes all old-style map file but I
haven't noticed that. So removing the old bsearch code seems
reasonable. Avoiding redundant decomposition of multibyte
characters into bytes seems reasonable from the view of
efficiency.

The new patch decomposes the structured pg_mb_radix_tree into a
series of (basically) plain member variables in a struct. I'm not
so in favor of the style but a radix tree of at least 4-levels is
easily read in the style and maybe the code to handle it is
rather easily readable. (So +1 for it)

> In particular, I found the explanations of flat and segmented tables
> really hard to understand. So in this version, the radix trees for a
> conversion are stored completely in one large array. Leaf and
> intermediate levels are all in the same array. When reading this
> version, please note that I'm not sure if I mean the same thing with
> "segment" that you did in your version.
> I moved the "lower" and "upper" values in the structs. Also, there are
> now also separate "lower" and "upper" values for the leaf levels of
> the trees, for 1- 2-, 3- and 4-byte inputs. This made a huge

The "segment" there seems to mean definitely the same to
mine. Flattening the on-memory structure is fine from the same
reason to the above.

> difference to the size of gb18030_to_utf8_radix.map, in particular:
> the source file shrank from about 2 MB to 1/2 MB. In that conversion,
> the valid range for the last byte of 2-byte inputs is 0x40-0xfe, and
> the valid range for the last byte of 4-byte inputs is 0x30-0x39. With
> the old patch version, the "chars" range was therefore 0x30-0xfe, to
> cover both of those, and most of the array was filled with zeros. With
> this new patch version, we store separate ranges for those, and can
> leave out most of the zeros.

Great. I agree that the (logically) devided chartable is
significantly space-efficient.

> There's a segment full of zeros at the beginning of each conversion
> array now. The purpose of that is that when traversing the radix tree,
> you don't need to check each intermediate value for 0. If you follow a
> 0 offset, it simply points to the dummy all-zeros segments in the
> beginning. Seemed like a good idea to shave some cycles, although I'm
> not sure if it made much difference in reality.

And I like the zero page.

> I optimized pg_mb_radix_conv() a bit, too. We could do more. For
> example, I think it would save some cycles to have specialized
> versions of UtfToLocal and LocalToUtf, moving the tests for whether a
> combined character map and/or conversion callback is used, out of the
> loop. They feel a bit ugly too, in their current form...

Hmm. Maybe decomposing iiso in pg_mb_radix_conv is faster than
pushing extra 3 (or 4) parameters into the stack (or it's wrong
if they are passed using registers?) but I'm not sure.

> I need a break now, but I'll try to pick this up again some time next
> week. Meanwhile, please have a look and tell me what you think.

Thank you very much for the big effort on this patch.

Apart from the aboves, I have some trivial comments on the new
version.

1. If we decide not to use old-style maps, UtfToLocal no longer
need to take void * as map data. (Patch 0001)

2. "use Data::Dumper" doesn't seem necessary. (Patch 0002)

3. A comment contains a superfluous comma. (Patch 0002) (The last
byte of the first line below)
> ### The segments are written out physically to one big array in the final,
> ### step, but logically, they form a radix tree. Or rather, four radix

4. The following code doesn't seem so perl'ish.

> for (my $i=0; $i <= 0xff; $i++)
> {
> my $val = $seg->{values}->{$i};
> if ($val)
> {
> $this_min = $i if (!defined $this_min || $i < $this_min);

Refraining from proposing extreme perl, the following would be
reasonable as an equivalent. (Patch 0002)

foreach $i (keys $seg->{values})
{

The 0002 patch contains the following change but this might be
kinda extreme..

- for (my $i=0; $i <= 0xff; $i++)
+ while ((my $i, my $val) = each $map)

4. download_srctxts.sh is no longer needed. (No patch)

I'll put more consideration on the new version and put another
version later.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Change-type-for-map-files-as-the-parameter-of-UtfToL.patch text/x-patch 2.8 KB
0002-Several-trivial-fixes-on-convutils.pm.patch text/x-patch 3.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2016-12-05 10:36:56 Re: A bug of psql completion
Previous Message Albe Laurenz 2016-12-05 10:18:55 Re: Parallel execution and prepared statements