Re: [v9.2] make_greater_string() does not return a string in some cases

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-21 01:36:46
Message-ID: 20111021.103646.221883029.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello,

> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> - Why does the second byte need special handling for 0xED and 0xF4?
> >
> > http://www.faqs.org/rfcs/rfc3629.html
> >
> > See section 4 in particular.  The underlying requirement is to disallow
> > multiple representations of the same Unicode code point.

The special handling skips the utf8 code regions corresponds to
the regions U+D800 - U+DFFF and U+110000 - U+11ffff in ucs-4. The
former is reserved for use with the UTF-16 encoding form as
surrougate pairs and do not directly represent characters as
described in section 3 of rfc3629. The latter is the region which
is out of the utf-8 range by the definition described also in the
same section.

former> The definition of UTF-8 prohibits encoding character
former> numbers between U+D800 and U+DFFF, which are reserved for
former> use with the UTF-16 encoding form (as surrogate pairs)
former> and do not directly represent characters.

latter> In UTF-8, characters from the U+0000..U+10FFFF range (the
latter> UTF-16 accessible range) are encoded using sequences of 1
latter> to 4 octets.

# However, I wrote this exception simplly mimicked the
# pg_utf8_validator()'s behavior at the beginning.

This must be the basis of the behavior of pg_utf8_verifier(), and
pg_utf8_increment() has taken over it. It may be good to describe
this origin of the special handling as comment of these functions
to avoid this sort of confusion.

> I'm still confused. The input string is already known to be valid
> UTF-8, so the second byte (if there is one) must be between 0x80 and
> 0xBF. Therefore it will be neither 0xED nor 0xF4.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Eric McKeeth 2011-10-21 05:43:23 Re: [GENERAL] One-click installer, Windows 7 32-bit, and icacls.exe
Previous Message Pavel Stehule 2011-10-20 21:25:28 Re: Can't use WITH in a PERFORM query in PL/pgSQL?

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2011-10-21 01:51:01 Re: loss of transactions in streaming replication
Previous Message Alvaro Herrera 2011-10-21 01:07:04 Re: ProcessStandbyHSFeedbackMessage can make global xmin go backwards