Re: Implementation of SASLprep for SCRAM-SHA-256

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Implementation of SASLprep for SCRAM-SHA-256
Date: 2017-04-05 04:23:12
Message-ID: CAB7nPqQqKih8rY6jZFgwtTh6dmbj7B9d0b48=60xPtu+5WqbVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

fore

On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I will continue tomorrow, but I wanted to report on what I've done so far.
> Attached is a new patch version, quite heavily modified. Notable changes so
> far:

Great, thanks!

> * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints.
> IMHO this makes the tables easier to read (to a human), and they are also
> packed slightly more tightly (see next two points), as you can fit more
> codepoints in a 16-bit integer.

Using directly codepoints is not much consistent with the existing
backend, but for the sake of packing things more, OK.

> * Reorganize the decomposition table. Instead of a separate binary-searched
> table for each "size", have one table that's ordered by codepoint, which
> contains a length and offset into another array, where all the decomposed
> sequences are. This makes the recomposition step somewhat slower, as it now
> has to grovel through an array that contains all decompositions, rather than
> just the array that contains decompositions of two characters. But this
> isn't performance-critical, readability and tight packing of the tables
> matter more.

Okay, no objection to that.

> * In the lookup table, store singleton decompositions that decompose to a
> single character, and the character fits in a uint16, directly in the main
> lookup table, instead of storing an index into the other table. This makes
> the tables somewhat smaller, as there are a lot of such decompositions.

Indeed.

> * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar
> suggests something that holds multi-byte characters in the OS locale, used
> by functions like wcscmp, so avoid that. I added a "typedef uint32
> Codepoint" alias, though, to make it more clear which variables hold Unicode
> codepoints rather than e.g. indexes into arrays.
>
> * Unicode consortium publishes a comprehensive normalization test suite, as
> a text file, NormalizationTest.txt. I wrote a small perl and C program to
> run all the tests from that test suite, with the normalization function.
> That uncovered a number of bugs in the recomposition algorithm, which I then
> fixed:

I was looking for something like that for some time, thanks! That's
here actually:
http://www.unicode.org/Public/8.0.0/ucd/NormalizationTest.txt

> * decompositions that map to ascii characters cannot be excluded.
> * non-canonical and non-starter decompositions need to be excluded. They
> are now flagged in the lookup table, so that we only use them during the
> decomposition phase, not when recomposing.

Okay on that.

> TODOs / Discussion points:
>
> * I'm not sure I like the way the code is organized, I feel that we're
> littering src/common with these. Perhaps we should add a
> src/common/unicode_normalization directory for all this.

pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their
own file I think, and wchar.c can use that.

> * The list of characters excluded from recomposition is currently hard-coded
> in utf_norm_generate.pl. However, that list is available in machine-readable
> format, in file CompositionExclusions.txt. Since we're reading most of the
> data from UnicodeData.txt, would be good to read the exclusion table from a
> file, too.

Ouch. Those are present here...
http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions
Definitely it makes more sense to read them from a file.

> * SASLPrep specifies normalization form KC, but it also specifies that some
> characters are mapped to space or nothing. Should do those mappings, too.

Ah, right. Those ones are here:
https://tools.ietf.org/html/rfc3454#appendix-B.1
The conversion tables need some extra tweaking...

+ else if ((*utf & 0xf0) == 0xe0)
+ {
+ if (utf[1] == 0 || utf[2] == 0)
+ goto invalid;
+ cp = (*utf++) & 0x0F;
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ }
+ else if ((*utf & 0xf8) == 0xf0)
+ {
+ if (utf[1] == 0 || utf[2] == 0|| utf[3] == 0)
+ goto invalid;
+ cp = (*utf++) & 0x07;
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ }
I find more readable something like pg_utf2wchar_with_len(), but well...

Some typos:
s/fore/for/
s/reprsented/represented/

You seem to be fully on it... I can help out with any of those items as needed.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2017-04-05 04:25:41 Re: Supporting huge pages on Windows
Previous Message Peter Eisentraut 2017-04-05 04:00:34 Re: Re: PATCH: pageinspect / add page_checksum and bt_page_items(bytea)