Re: speed up unicode decomposition and recomposition

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, daniel(at)manitou-mail(dot)org
Subject: Re: speed up unicode decomposition and recomposition
Date: 2020-10-16 03:32:08
Message-ID: 20201016033208.GC1581@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 15, 2020 at 01:59:38PM -0400, John Naylor wrote:
> I think I've seen a trie recommended somewhere, maybe the official website.
> That said, I was able to get the hash working for recomposition (split into
> a separate patch, and both of them now leave frontend alone), and I'm
> pleased to say it's 50-75x faster than linear search in simple tests. I'd
> be curious how it compares to ICU now. Perhaps Daniel Verite would be
> interested in testing again? (CC'd)

Yeah, that would be interesting to compare. Now the gains proposed by
this patch are already a good step forward, so I don't think that it
should be a blocker for a solution we have at hand as the numbers
speak by themselves here. So if something better gets proposed, we
could always change the decomposition and recomposition logic as
needed.

> select count(normalize(t, NFC)) from (
> select md5(i::text) as t from
> generate_series(1,100000) as i
> ) s;
>
> master patch
> 18800ms 257ms

My environment was showing HEAD as being a bit faster with 15s, while
the patch gets "only" down to 290~300ms (compiled with -O2, as I guess
you did). Nice.

+ # Then the second
+ return -1 if $a2 < $b2;
+ return 1 if $a2 > $b2;
Should say "second code point" here?

+ hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+ h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.

The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
Structurally, wouldn't it be better to move this part into its own,
backend-only, header? It could be possible to paint the difference
with some ifdef FRONTEND of course, or just keep things as they are
because this can be useful for some out-of-core frontend tool? But if
we keep that as a separate header then any C part can decide to
include it or not, so frontend tools could also make this choice.
Note that we don't include unicode_normprops_table.h for frontends in
unicode_norm.c, but that's the case of unicode_norm_table.h.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hou, Zhijie 2020-10-16 03:42:34 RE: Use list_delete_xxxcell O(1) instead of list_delete_ptr O(N) in some places
Previous Message Andres Freund 2020-10-16 03:27:02 Re: gs_group_1 crashing on 13beta2/s390x