Re: Implementation of SASLprep for SCRAM-SHA-256

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Implementation of SASLprep for SCRAM-SHA-256
Date: 2017-04-04 22:05:12
Message-ID: 4e9eb400-0827-dbd6-8e48-661b31f2be5e@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/04/2017 01:52 PM, Heikki Linnakangas wrote:
> On 03/31/2017 10:10 AM, Michael Paquier wrote:
>> On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier
>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> I kinda hope Heikki is going to step up to the plate here, because I
>>> think he understands most of it already.
>>
>> The second person who knows a good deal about NFKC is Ishii-san.
>>
>> Attached is a rebased patch.
>
> Thanks, I'm looking at this now.

I will continue tomorrow, but I wanted to report on what I've done so
far. Attached is a new patch version, quite heavily modified. Notable
changes so far:

* Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit
ints. IMHO this makes the tables easier to read (to a human), and they
are also packed slightly more tightly (see next two points), as you can
fit more codepoints in a 16-bit integer.

* Reorganize the decomposition table. Instead of a separate
binary-searched table for each "size", have one table that's ordered by
codepoint, which contains a length and offset into another array, where
all the decomposed sequences are. This makes the recomposition step
somewhat slower, as it now has to grovel through an array that contains
all decompositions, rather than just the array that contains
decompositions of two characters. But this isn't performance-critical,
readability and tight packing of the tables matter more.

* In the lookup table, store singleton decompositions that decompose to
a single character, and the character fits in a uint16, directly in the
main lookup table, instead of storing an index into the other table.
This makes the tables somewhat smaller, as there are a lot of such
decompositions.

* Use uint32 instead of pg_wchar to represent unicode codepoints.
pg_wchar suggests something that holds multi-byte characters in the OS
locale, used by functions like wcscmp, so avoid that. I added a "typedef
uint32 Codepoint" alias, though, to make it more clear which variables
hold Unicode codepoints rather than e.g. indexes into arrays.

* Unicode consortium publishes a comprehensive normalization test suite,
as a text file, NormalizationTest.txt. I wrote a small perl and C
program to run all the tests from that test suite, with the
normalization function. That uncovered a number of bugs in the
recomposition algorithm, which I then fixed:

* decompositions that map to ascii characters cannot be excluded.
* non-canonical and non-starter decompositions need to be excluded.
They are now flagged in the lookup table, so that we only use them
during the decomposition phase, not when recomposing.

TODOs / Discussion points:

* I'm not sure I like the way the code is organized, I feel that we're
littering src/common with these. Perhaps we should add a
src/common/unicode_normalization directory for all this.

* The list of characters excluded from recomposition is currently
hard-coded in utf_norm_generate.pl. However, that list is available in
machine-readable format, in file CompositionExclusions.txt. Since we're
reading most of the data from UnicodeData.txt, would be good to read the
exclusion table from a file, too.

* SASLPrep specifies normalization form KC, but it also specifies that
some characters are mapped to space or nothing. Should do those
mappings, too.

- Heikki

Attachment Content-Type Size
implement-SASLprep-2.patch.gz application/gzip 60.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2017-04-04 22:06:42 Re: increasing the default WAL segment size
Previous Message Andres Freund 2017-04-04 21:46:28 Re: bug in SlabAlloc / VALGRIND_MAKE_MEM_DEFINED