Quick Links

Re: Implementing full UTF-8 support (aka supporting 0x00)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	Kevin Grittner <kgrittn(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 19:41:54
Message-ID:	16003.1470253314@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= <aht(at)8kdata(dot)com> writes:
> According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
> the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code
> point. In short, the \u00 nul is represented (overlong encoding) by the
> two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8
> so should not appear in an otherwise valid UTF-8 string. Yet they are
> accepted by Postgres (like if Postgres would support Modified UTF-8
> intentionally).

Really? It sure looks to me like pg_utf8_islegal() would reject this.

We could hack it to allow the case, no doubt, but I concur with Peter's
concern that we'd have trouble with OS-level code that is strict about
what UTF8 allows. glibc, for example, is known to do very strange things
with strings that it thinks are invalid in the active encoding.

regards, tom lane

In response to

Re: Implementing full UTF-8 support (aka supporting 0x00) at 2016-08-03 19:13:18 from Álvaro Hernández Tortosa

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Geoff Winkless	2016-08-03 19:42:40	Re: Implementing full UTF-8 support (aka supporting 0x00)
Previous Message	Claudio Freire	2016-08-03 19:37:58	Re: Lossy Index Tuple Enhancement (LITE)