Re: speed up unicode decomposition and recomposition

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, daniel(at)manitou-mail(dot)org
Subject: Re: speed up unicode decomposition and recomposition
Date: 2020-10-15 17:59:38
Message-ID: CAFBsxsFFCbooybVsWDj_wzDLEKN04YqCRssQ5Li0bUyWvR9eVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 15, 2020 at 1:30 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
wrote:

> At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> > John Naylor <john(dot)naylor(at)enterprisedb(dot)com> writes:
> > > With those points in mind and thinking more broadly, I'd like to try
> harder
> > > on recomposition. Even several times faster, recomposition is still
> orders
> > > of magnitude slower than ICU, as measured by Daniel Verite [1].
> >
> > Huh. Has anyone looked into how they do it?
>
> I'm not sure it is that, but it would be that.. It uses separate
> tables for decomposition and composition pointed from a trie?
>

I think I've seen a trie recommended somewhere, maybe the official website.
That said, I was able to get the hash working for recomposition (split into
a separate patch, and both of them now leave frontend alone), and I'm
pleased to say it's 50-75x faster than linear search in simple tests. I'd
be curious how it compares to ICU now. Perhaps Daniel Verite would be
interested in testing again? (CC'd)

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

master patch
18800ms 257ms

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;

master patch
13000ms 254ms

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v2-0001-Speed-up-unicode-decomposition.patch application/x-patch 117.9 KB
v2-0002-Speed-up-unicode-recomposition.patch application/x-patch 54.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-10-15 18:17:21 Re: plan cache doesn't clean plans with references to dropped procedures
Previous Message Justin Pryzby 2020-10-15 17:57:25 Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers