Re: SCRAM authentication, take three

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: a(dot)alekseev(at)postgrespro(dot)ru, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SCRAM authentication, take three
Date: 2017-03-03 05:43:25
Message-ID: CAB7nPqT6+4NU8aV+tpq=3OMkhgJkfLqdn5A=K0cmayjqu4qOaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 2, 2017 at 9:13 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I'm studying the normalization of Unicode so I apologize possible
> stupidity in advance.
>
> I looked into this and have some comments. Sorry for the random
> order.

Thanks, this needs a lot of background and I am glad that somebody is
taking the time to look at what I am doing here.

> ====
> Composition version should be written some where.

Sure.

> ====
> Perhaps one of the most important things is that the code exactly
> reflects the TR. pg_utf8_check_string returns true for ASCII
> strings but the TR says that
>
> | Text containing only ASCII characters (U+0000 to U+007F) is left
> | unaffected by all of the normalization forms. This is
> | particularly important for programming languages
>
> And running SASLprepare for apparent ASCII string (which is the
> most case) is a waste of CPU cycles.

Yeah, that's true. We could just for example check in
pg_utf8_check_string() if the length gathered matches strlen(source)
as only ASCII are 1-byte long.

> ====
> From the point of view of reflectivity (please someone teach me an
> appropreate wording for this..), basically the code had better to
> be a copy of the reference code as long as no performance
> degradation occurs. Hangul part of get_decomposed_size(and
> decompose_code, recompose_code) uses different naming from the
> TR. hindex should be sindex and t should be tindex. Magic numbers
> should have names in the TR.
>
> * A bit later, I noticed that these are copies of charlint. If so
> I want a description about that.)

Yeah, their stuff works quite nicely.

> ====
> The following comment is equivalent to "reordering in canonical
> order". But the definition of "decomposition" includes this
> step. (I'm not sure if it needs rewriting, since I acutually
> could understand that.)
>
>> /*
>> * Now that the decomposition is done, apply the combining class
>> * for each multibyte character.
>> */

I have reworked a bit this one:
/*
- * Now that the decomposition is done, apply the combining class
- * for each multibyte character.
+ * Now end the decomposition by applying the combining class for
+ * each multibyte character.
*/

> ====
>> * make the allocation of the recomposed string based on that assumption.
>> */
>> recomp_chars = (pg_wchar *) malloc(decomp_size * sizeof(int));
>> lastClass = -1; /* this eliminates a special check */
>
> utf_sasl_prepare uses variable names with two naming
> conventions. Is there any reason for the two?

OK, did some adjustments here.

> ====
>> starterCh = recomp_chars[0] = decomp_chars[0];
>
> starterCh reads as "starter channel" why not "starter_char"?

Starter character of the current set, which is a character with a
combining class of 0.

> ====
>> else if (start >= SBASE && start < (SBASE + SCOUNT) &&
>> ((start - SBASE) % TCOUNT) == 0 &&
>> code >= TBASE && code <= (TBASE + TCOUNT))
>
> "code <= (TBASE + TCOUNT)" somewhat smells. Then I found the
> original code for the current implementation in charlint and it
> seems correct to me. Some description about the difference
> between them is needed.

Right. I have updated all those things to use constants instead of
hardcoded values.

> ====
> In use_sasl_prepare, the recompose part is the copy of charlint
> but it seems a bit inefficient. It calls recompose_code
> unconditionally but it is required only for the case of
> "lastClass < chClass".
>
> Something like this. (This still calls recompose_code for the
> case that ch is the same position with starterChar so there still
> be room for more improvement.)

Agreed.

> ====
> If I read the TR correctly, "6 Composition Exclusion Table" says
> that there some characters not to be composed. But I don't find
> the corresponding code. (or comments)

Ah, right! There are 100 characters that enter in this category, and
all of them have a combining class of 0, so it is as simple as
removing them from the tables generated.

I am attaching 0009 and 0010 that address those problems (pushed on
github as well) that can be applied on top of the latest set.
--
Michael

Attachment Content-Type Size
0009-Set-of-fixes-for-SASLprep.patch application/octet-stream 7.1 KB
0010-Consider-characters-excluded-from-composition-in-con.patch application/octet-stream 14.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2017-03-03 05:53:00 Re: Performance degradation in TPC-H Q18
Previous Message Ashutosh Bapat 2017-03-03 04:26:56 Re: Print correct startup cost for the group aggregate.