Re: speed up unicode normalization quick check

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: speed up unicode normalization quick check
Date: 2020-09-19 17:46:08
Message-ID: B999AEF6-24A1-40E3-B548-7D28043EDCCE@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Sep 18, 2020, at 9:41 AM, John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> wrote:
>
> Attached is version 4, which excludes the output file from pgindent,
> to match recent commit 74d4608f5. Since it won't be indented again, I
> also tweaked the generator script to match pgindent for the typedef,
> since we don't want to lose what pgindent has fixed already. This last
> part isn't new to v4, but I thought I'd highlight it anyway.
>
> --
> John Naylor https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> <v4-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch><v4-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch><v4-0003-Use-perfect-hashing-for-NKFC-Unicode-normalizatio.patch>

0002 and 0003 look good to me. I like the way you cleaned up a bit with the unicode_norm_props struct, which makes the code a bit more tidy, and on my compiler under -O2 it does not generate any extra runtime dereferences, as the compiler can see through the struct just fine. My only concern would be if some other compilers might not see through the struct, resulting in a runtime performance cost? I wouldn't even ask, except that qc_hash_lookup is called in a fairly tight loop.

To clarify, the following changes to the generated code which remove the struct and corresponding dereferences (not intended as a patch submission) cause zero bytes of change in the compiled output for me on mac/clang, which is good, and generate inconsequential changes on linux/gcc, which is also good, but I wonder if that is true for all compilers. In your commit message for 0001 you mentioned testing on a multiplicity of compilers. Did you do that for 0002 and 0003 as well?

diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c
index 1714837e64..976b96e332 100644
--- a/src/common/unicode_norm.c
+++ b/src/common/unicode_norm.c
@@ -476,8 +476,11 @@ qc_compare(const void *p1, const void *p2)
return (v1 - v2);
}

-static const pg_unicode_normprops *
-qc_hash_lookup(pg_wchar ch, const unicode_norm_info * norminfo)
+static inline const pg_unicode_normprops *
+qc_hash_lookup(pg_wchar ch,
+ const pg_unicode_normprops *normprops,
+ qc_hash_func hash,
+ int num_normprops)
{
int h;
uint32 hashkey;
@@ -487,21 +490,21 @@ qc_hash_lookup(pg_wchar ch, const unicode_norm_info * norminfo)
* in network order.
*/
hashkey = htonl(ch);
- h = norminfo->hash(&hashkey);
+ h = hash(&hashkey);

/* An out-of-range result implies no match */
- if (h < 0 || h >= norminfo->num_normprops)
+ if (h < 0 || h >= num_normprops)
return NULL;

/*
* Since it's a perfect hash, we need only match to the specific codepoint
* it identifies.
*/
- if (ch != norminfo->normprops[h].codepoint)
+ if (ch != normprops[h].codepoint)
return NULL;

/* Success! */
- return &norminfo->normprops[h];
+ return &normprops[h];
}

/*
@@ -518,7 +521,10 @@ qc_is_allowed(UnicodeNormalizationForm form, pg_wchar ch)
switch (form)
{
case UNICODE_NFC:
- found = qc_hash_lookup(ch, &UnicodeNormInfo_NFC_QC);
+ found = qc_hash_lookup(ch,
+ UnicodeNormProps_NFC_QC,
+ NFC_QC_hash_func,
+ NFC_QC_num_normprops);
break;
case UNICODE_NFKC:
found = bsearch(&key,
diff --git a/src/include/common/unicode_normprops_table.h b/src/include/common/unicode_normprops_table.h
index 5e1e382af5..38300cfa12 100644
--- a/src/include/common/unicode_normprops_table.h
+++ b/src/include/common/unicode_normprops_table.h
@@ -13,13 +13,6 @@ typedef struct
signed int quickcheck:4; /* really UnicodeNormalizationQC */
} pg_unicode_normprops;

-typedef struct
-{
- const pg_unicode_normprops *normprops;
- qc_hash_func hash;
- int num_normprops;
-} unicode_norm_info;
-
static const pg_unicode_normprops UnicodeNormProps_NFC_QC[] = {
{0x0300, UNICODE_NORM_QC_MAYBE},
{0x0301, UNICODE_NORM_QC_MAYBE},
@@ -1583,12 +1576,6 @@ NFC_QC_hash_func(const void *key)
return h[a % 2463] + h[b % 2463];
}

-static const unicode_norm_info UnicodeNormInfo_NFC_QC = {
- UnicodeNormProps_NFC_QC,
- NFC_QC_hash_func,
- 1231
-};
-
static const pg_unicode_normprops UnicodeNormProps_NFKC_QC[] = {
{0x00A0, UNICODE_NORM_QC_NO},
{0x00A8, UNICODE_NORM_QC_NO},

--
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-09-19 17:55:59 Re: Probable documentation errors or improvements
Previous Message Tom Lane 2020-09-19 17:20:34 Re: Feature proposal for psql