From: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Implementation of SASLprep for SCRAM-SHA-256 |
Date: | 2017-04-05 04:23:12 |
Message-ID: | CAB7nPqQqKih8rY6jZFgwtTh6dmbj7B9d0b48=60xPtu+5WqbVw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
fore
On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I will continue tomorrow, but I wanted to report on what I've done so far.
> Attached is a new patch version, quite heavily modified. Notable changes so
> far:
Great, thanks!
> * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints.
> IMHO this makes the tables easier to read (to a human), and they are also
> packed slightly more tightly (see next two points), as you can fit more
> codepoints in a 16-bit integer.
Using directly codepoints is not much consistent with the existing
backend, but for the sake of packing things more, OK.
> * Reorganize the decomposition table. Instead of a separate binary-searched
> table for each "size", have one table that's ordered by codepoint, which
> contains a length and offset into another array, where all the decomposed
> sequences are. This makes the recomposition step somewhat slower, as it now
> has to grovel through an array that contains all decompositions, rather than
> just the array that contains decompositions of two characters. But this
> isn't performance-critical, readability and tight packing of the tables
> matter more.
Okay, no objection to that.
> * In the lookup table, store singleton decompositions that decompose to a
> single character, and the character fits in a uint16, directly in the main
> lookup table, instead of storing an index into the other table. This makes
> the tables somewhat smaller, as there are a lot of such decompositions.
Indeed.
> * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar
> suggests something that holds multi-byte characters in the OS locale, used
> by functions like wcscmp, so avoid that. I added a "typedef uint32
> Codepoint" alias, though, to make it more clear which variables hold Unicode
> codepoints rather than e.g. indexes into arrays.
>
> * Unicode consortium publishes a comprehensive normalization test suite, as
> a text file, NormalizationTest.txt. I wrote a small perl and C program to
> run all the tests from that test suite, with the normalization function.
> That uncovered a number of bugs in the recomposition algorithm, which I then
> fixed:
I was looking for something like that for some time, thanks! That's
here actually:
http://www.unicode.org/Public/8.0.0/ucd/NormalizationTest.txt
> * decompositions that map to ascii characters cannot be excluded.
> * non-canonical and non-starter decompositions need to be excluded. They
> are now flagged in the lookup table, so that we only use them during the
> decomposition phase, not when recomposing.
Okay on that.
> TODOs / Discussion points:
>
> * I'm not sure I like the way the code is organized, I feel that we're
> littering src/common with these. Perhaps we should add a
> src/common/unicode_normalization directory for all this.
pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their
own file I think, and wchar.c can use that.
> * The list of characters excluded from recomposition is currently hard-coded
> in utf_norm_generate.pl. However, that list is available in machine-readable
> format, in file CompositionExclusions.txt. Since we're reading most of the
> data from UnicodeData.txt, would be good to read the exclusion table from a
> file, too.
Ouch. Those are present here...
http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions
Definitely it makes more sense to read them from a file.
> * SASLPrep specifies normalization form KC, but it also specifies that some
> characters are mapped to space or nothing. Should do those mappings, too.
Ah, right. Those ones are here:
https://tools.ietf.org/html/rfc3454#appendix-B.1
The conversion tables need some extra tweaking...
+ else if ((*utf & 0xf0) == 0xe0)
+ {
+ if (utf[1] == 0 || utf[2] == 0)
+ goto invalid;
+ cp = (*utf++) & 0x0F;
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ }
+ else if ((*utf & 0xf8) == 0xf0)
+ {
+ if (utf[1] == 0 || utf[2] == 0|| utf[3] == 0)
+ goto invalid;
+ cp = (*utf++) & 0x07;
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ cp = (cp << 6) | ((*utf++) & 0x3F);
+ }
I find more readable something like pg_utf2wchar_with_len(), but well...
Some typos:
s/fore/for/
s/reprsented/represented/
You seem to be fully on it... I can help out with any of those items as needed.
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Tsunakawa, Takayuki | 2017-04-05 04:25:41 | Re: Supporting huge pages on Windows |
Previous Message | Peter Eisentraut | 2017-04-05 04:00:34 | Re: Re: PATCH: pageinspect / add page_checksum and bt_page_items(bytea) |