Re: Unicode normalization SQL functions

From: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: Unicode normalization SQL functions
Date: 2020-04-03 04:01:47
Message-ID: CACPNZCsptwBXYGqxT8RNsYVAc0dncJkCvtchUDg=6Jbrbuo8Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>
> On 2020-03-26 18:41, John Naylor wrote:
> > We don't have a trie implementation in Postgres, but we do have a
> > perfect hash implementation. Doing that would bring the tables back to
> > 64 bits per entry, but would likely be noticeably faster than binary
> > search. Since v4 has left out the biggest tables entirely, I think
> > this might be worth a look for the smaller tables remaining.
> >
> > In the attached v5, when building the hash tables, we sort the code
> > points by NO/MAYBE, and store the index of the beginning of the NO
> > block:
>
> This is a valuable idea, but I fear it's a bit late now in this cycle.
> I have questions about some details. For example, you mention that you
> had to fiddle with the hash seed. How does that affect other users of
> PerfectHash?

They would still try the same multipliers they use now, so no effect on them.

> What happens when we update Unicode data and the hash
> doesn't work anymore?

The script would choose different multipliers and/or seeds
automatically. Only if you're unlucky would you have to fiddle with
the hash parameters again. The last-resort multipliers in the v2 patch
in the other thread [1] seem very effective and easily build both the
quick check D tables, which I tried for amusement's sake.

That said, we could reduce the chances of that happening this way:
After trying all the shift-and-add multipliers, we could add another
loop to try a bunch of numbers in a range. We'd need a quick check to
weed out multiples of small primes so the number has a decent chance
of being prime. To save time, just try a few seeds and move on to the
next number. Maybe emit a warning that it exhausted the shift-and-add
multipliers in case the developer wanted to intervene.

If I resubmit this, I would split the build up into two steps: have
the current manual script build the quick check array for later commit
into the tree, and build the hash function separately from that as a
Makefile distprep target. There's no reason to have the hash functions
checked in as I did in v5, like we don't check in the keyword hash
functions.

I would also consider splitting it into two patches:

1. Keep binary search but with a more abstract array representation
(think PG_RMGR). This patch would be large but mostly mechanical.
2. Switch to hash lookup. A smaller one for ease of review.

[1] https://www.postgresql.org/message-id/CACPNZCvMMj88Bsnk1k%3DRffW6gBw%2BFH7wcwCBfcKLDM%3DUEG2UWg%40mail.gmail.com

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-04-03 04:05:13 Re: WAL usage calculation patch
Previous Message Dilip Kumar 2020-04-03 03:47:54 Re: WAL usage calculation patch