Re: Radix tree for character conversion

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, daniel(at)yesql(dot)se
Cc: peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2016-12-02 20:07:07
Message-ID: da58a154-0b28-802e-5e82-5a205f53926e@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
> Thanks. The attached patch contains the patch by perlcritic.
>
> 0001,2,3 are Heikki's patch that are not modified since it is
> first proposed. It's a bit too big so I don't attach them to this
> mail (again).
>
> https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi

I've now pushed these preliminary patches, with the applicable fixes
from you and Daniel. The attached patch is now against git master.

> 0004 is radix-tree stuff, applies on top of the three patches
> above.

I've spent the last couple of days reviewing this. While trying to
understand how it works, I ended up dismantling, rewriting, and putting
back together most of the added perl code. Attached is a new version,
with more straightforward logic, making it more understandable. I find
it more understandable, anyway, I hope it's not only because I wrote it
myself :-). Let me know what you think.

In particular, I found the explanations of flat and segmented tables
really hard to understand. So in this version, the radix trees for a
conversion are stored completely in one large array. Leaf and
intermediate levels are all in the same array. When reading this
version, please note that I'm not sure if I mean the same thing with
"segment" that you did in your version.

I moved the "lower" and "upper" values in the structs. Also, there are
now also separate "lower" and "upper" values for the leaf levels of the
trees, for 1- 2-, 3- and 4-byte inputs. This made a huge difference to
the size of gb18030_to_utf8_radix.map, in particular: the source file
shrank from about 2 MB to 1/2 MB. In that conversion, the valid range
for the last byte of 2-byte inputs is 0x40-0xfe, and the valid range for
the last byte of 4-byte inputs is 0x30-0x39. With the old patch version,
the "chars" range was therefore 0x30-0xfe, to cover both of those, and
most of the array was filled with zeros. With this new patch version, we
store separate ranges for those, and can leave out most of the zeros.

There's a segment full of zeros at the beginning of each conversion
array now. The purpose of that is that when traversing the radix tree,
you don't need to check each intermediate value for 0. If you follow a 0
offset, it simply points to the dummy all-zeros segments in the
beginning. Seemed like a good idea to shave some cycles, although I'm
not sure if it made much difference in reality.

I optimized pg_mb_radix_conv() a bit, too. We could do more. For
example, I think it would save some cycles to have specialized versions
of UtfToLocal and LocalToUtf, moving the tests for whether a combined
character map and/or conversion callback is used, out of the loop. They
feel a bit ugly too, in their current form...

I need a break now, but I'll try to pick this up again some time next
week. Meanwhile, please have a look and tell me what you think.

- Heikki

Attachment Content-Type Size
maps-as-radix-trees-1.patch text/x-patch 99.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-12-02 20:18:46 Re: Radix tree for character conversion
Previous Message Tom Lane 2016-12-02 20:05:00 Re: Wrong order of tests in findDependentObjects()