From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | tgl(at)sss(dot)pgh(dot)pa(dot)us |
Cc: | john(dot)naylor(at)enterprisedb(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: speed up unicode decomposition and recomposition |
Date: | 2020-10-15 05:30:47 |
Message-ID: | 20201015.143047.941890614281076696.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> John Naylor <john(dot)naylor(at)enterprisedb(dot)com> writes:
> > With those points in mind and thinking more broadly, I'd like to try harder
> > on recomposition. Even several times faster, recomposition is still orders
> > of magnitude slower than ICU, as measured by Daniel Verite [1].
>
> Huh. Has anyone looked into how they do it?
I'm not sure it is that, but it would be that.. It uses separate
tables for decomposition and composition pointed from a trie?
That table is used after trying algorithmic decomposition/composition
for, for example, Hangul. I didn't look it any fruther but just for
information, icu4c/source/common/normalizer2impl.cpp seems doing that.
For example icu4c/srouce/common/norm2_nfc_data.h defines the static data.
icu4c/source/common/normalier2impl.h:244 points a design documentation
of normalization.
http://site.icu-project.org/design/normalization/custom
> Old and New Implementation Details
>
> The old normalization data format (unorm.icu, ca. 2001..2009) uses
> three data structures for normalization: A trie for looking up 32-bit
> values for every code point, a 16-bit-unit array with decompositions
> and some other data, and a composition table (16-bit-unit array,
> linear search list per starter). The data is combined for all 4
> standard normalization forms: NFC, NFD, NFKC and NFKD.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2020-10-15 05:36:46 | Re: Resetting spilled txn statistics in pg_stat_replication |
Previous Message | Masahiko Sawada | 2020-10-15 05:28:57 | Re: Add Information during standby recovery conflicts |