Re: Password identifiers, protocol aging and SCRAM protocol

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Julian Markwort <julian(dot)markwort(at)uni-muenster(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Valery Popov <v(dot)popov(at)postgrespro(dot)ru>
Subject: Re: Password identifiers, protocol aging and SCRAM protocol
Date: 2017-02-03 23:01:10
Message-ID: CAB7nPqQn7puxLWNRC5fAn_awqXz7-+8A=f41h0XdE-NE_iB+Og@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 3, 2017 at 9:52 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 12/20/2016 03:47 AM, Michael Paquier wrote:
>>
>> The first thing is to be able to understand in the SCRAM code if a
>> string is UTF-8 or not, and this code is in src/common/. pg_wchar.c
>> offers a set of routines exactly for this purpose, which is built with
>> libpq but that's not available for src/common/. So instead of moving
>> all the file, I'd like to create a new file in src/common/utf8.c which
>> includes pg_utf_mblen() and pg_utf8_islegal().
>
> Sounds reasonable. They're short functions, might also be ok to just
> copy-paste them to scram-common.c.

Having a separate file makes the most sense to me I think, if we can
avoid code duplication that's better.

>> The second thing is the normalization itself. Per RFC4013, NFKC needs
>> to be applied to the string. The operation is described in [1]
>> completely, and it is named as doing 1) a compatibility decomposition
>> of the bytes of the string, followed by 2) a canonical composition.
>>
>> About 1). The compatibility decomposition is defined in [2], "by
>> recursively applying the canonical and compatibility mappings, then
>> applying the canonical reordering algorithm". Canonical and
>> compatibility mapping are some data available in UnicodeData.txt, the
>> 6th column of the set defined in [3] to be precise. The meaning of the
>> decomposition mappings is defined in [2] as well. The canonical
>> decomposition is basically to look for a given UTF-8 character, and
>> then apply the multiple characters resulting in its new shape. The
>> compatibility mapping should as well be applied, but [5], a perl tool
>> called charlint.pl doing this normalization work, does not care about
>
> Not sure. We need to do whatever the "right thing" is, according to the RFC.
> I would assume that the spec is not ambiguous this, but I haven't looked
> into the details. If it's ambiguous, then I think we need to look at some
> popular implementations to see what they do.

The spec defines quite correctly what should be done. The
implementations are sometimes quite loose on some points though (see
charlint.pl).

>> So what we need from Postgres side is a mapping table to, having the
>> following fields:
>> 1) Hexa sequence of UTF8 character.
>> 2) Its canonical combining class.
>> 3) The kind of decomposition mapping if defined.
>> 4) The decomposition mapping, in hexadecimal format.
>> Based on what I looked at, either perl or python could be used to
>> process UnicodeData.txt and to generate a header file that would be
>> included in the tree. There are 30k entries in UnicodeData.txt, 5k of
>> them have a mapping, so that will result in many tables. One thing to
>> improve performance would be to store the length of the table in a
>> static variable, order the entries by their hexadecimal keys and do a
>> dichotomy lookup to find an entry. We could as well use more fancy
>> things like a set of tables using a Radix tree using decomposed by
>> bytes. We should finish by just doing one lookup of the table for each
>> character sets anyway.
>
> Ok. I'm not too worried about the performance of this. It's only used for
> passwords, which are not that long, and it's only done when connecting. I'm
> more worried about the disk/memory usage. How small can we pack the tables?
> 10kB? 100kB? Even a few MB would probably not be too bad in practice, but
> I'd hate to bloat up libpq just for this.

Indeed. I think I'll develop first a small utility able to do
operation. There is likely some knowledge in mb/Unicode that we can
use here. The radix tree patch would perhaps help?

>> 3) The shape of the mapping table, which depends on how many
>> operations we want to support in the normalization of the strings.
>> The decisions for those items will drive the implementation in one
>> sense or another.
>
> Let's aim for small disk/memory footprint.

OK, I'll try to give it a shot in a couple of days in the shape of an
extention or something like that. Thanks for the feedback.
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2017-02-03 23:01:34 Re: [PATCH] Rename pg_switch_xlog to pg_switch_wal
Previous Message Andres Freund 2017-02-03 23:00:59 Re: logical decoding of two-phase transactions