Re: Password identifiers, protocol aging and SCRAM protocol

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Julian Markwort <julian(dot)markwort(at)uni-muenster(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Valery Popov <v(dot)popov(at)postgrespro(dot)ru>
Subject: Re: Password identifiers, protocol aging and SCRAM protocol
Date: 2017-02-03 12:52:52
Message-ID: a55cffda-8db6-8fd2-29aa-bfe5c48dad78@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/20/2016 03:47 AM, Michael Paquier wrote:
> The first thing is to be able to understand in the SCRAM code if a
> string is UTF-8 or not, and this code is in src/common/. pg_wchar.c
> offers a set of routines exactly for this purpose, which is built with
> libpq but that's not available for src/common/. So instead of moving
> all the file, I'd like to create a new file in src/common/utf8.c which
> includes pg_utf_mblen() and pg_utf8_islegal().

Sounds reasonable. They're short functions, might also be ok to just
copy-paste them to scram-common.c.

> On top of that I think that having a routine able to check a full
> string would be useful for many users, as pg_utf8_islegal() can only
> check one set of characters. If the password string is found to be of
> UTF-8 format, SASLprepare is applied. If not, the string is copied
> as-is with perhaps unexpected effects for the client But he's in
> trouble already if client is not using UTF-8.

Yeah.

> The second thing is the normalization itself. Per RFC4013, NFKC needs
> to be applied to the string. The operation is described in [1]
> completely, and it is named as doing 1) a compatibility decomposition
> of the bytes of the string, followed by 2) a canonical composition.
>
> About 1). The compatibility decomposition is defined in [2], "by
> recursively applying the canonical and compatibility mappings, then
> applying the canonical reordering algorithm". Canonical and
> compatibility mapping are some data available in UnicodeData.txt, the
> 6th column of the set defined in [3] to be precise. The meaning of the
> decomposition mappings is defined in [2] as well. The canonical
> decomposition is basically to look for a given UTF-8 character, and
> then apply the multiple characters resulting in its new shape. The
> compatibility mapping should as well be applied, but [5], a perl tool
> called charlint.pl doing this normalization work, does not care about
> this phase... Do we?

Not sure. We need to do whatever the "right thing" is, according to the
RFC. I would assume that the spec is not ambiguous this, but I haven't
looked into the details. If it's ambiguous, then I think we need to look
at some popular implementations to see what they do.

> About 2)... Once the decomposition has been applied, those bytes need
> to be recomposed using the Canonical_Combining_Class field of
> UnicodeData.txt in [3], which is the 3rd column of the set. Its values
> are defined in [4]. An other interesting thing, charlint.pl [5] does
> not care about this phase. I am wondering if we should as well not
> just drop this part as well...
>
> Once 1) and 2) are done, NKFC is complete, and so is SASLPrepare.

Ok.

> So what we need from Postgres side is a mapping table to, having the
> following fields:
> 1) Hexa sequence of UTF8 character.
> 2) Its canonical combining class.
> 3) The kind of decomposition mapping if defined.
> 4) The decomposition mapping, in hexadecimal format.
> Based on what I looked at, either perl or python could be used to
> process UnicodeData.txt and to generate a header file that would be
> included in the tree. There are 30k entries in UnicodeData.txt, 5k of
> them have a mapping, so that will result in many tables. One thing to
> improve performance would be to store the length of the table in a
> static variable, order the entries by their hexadecimal keys and do a
> dichotomy lookup to find an entry. We could as well use more fancy
> things like a set of tables using a Radix tree using decomposed by
> bytes. We should finish by just doing one lookup of the table for each
> character sets anyway.

Ok. I'm not too worried about the performance of this. It's only used
for passwords, which are not that long, and it's only done when
connecting. I'm more worried about the disk/memory usage. How small can
we pack the tables? 10kB? 100kB? Even a few MB would probably not be too
bad in practice, but I'd hate to bloat up libpq just for this.

> In conclusion, at this point I am looking for feedback regarding the
> following items:
> 1) Where to put the UTF8 check routines and what to move.

Covered that above.

> 2) How to generate the mapping table using UnicodeData.txt. I'd think
> that using perl would be better.

Agreed, it needs to be in Perl. That's what we require to be present
when building PostgreSQL, it's what we use for generating other tables
and functions.

> 3) The shape of the mapping table, which depends on how many
> operations we want to support in the normalization of the strings.
> The decisions for those items will drive the implementation in one
> sense or another.

Let's aim for small disk/memory footprint.

- Heikki

> [1]: http://www.unicode.org/reports/tr15/#Description_Norm
> [2]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Character_Decomposition_Mappings
> [3]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
> [4]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Canonical_Combining_Class_Values
> [5]: https://www.w3.org/International/charlint/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-02-03 13:04:55 Re: Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Jesper Pedersen 2017-02-03 12:45:56 Re: pageinspect: Hash index support