Re: UTF8 with BOM support in psql

From: Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UTF8 with BOM support in psql
Date: 2009-11-17 08:59:25
Message-ID: 2106D8DC89010842BABA5CD03FEA4061012E8BE3B9@EXVMBX018-10.exch018.msoutlookonline.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> I don't know what the best solution is here. The BOM encoded as UTF-8
> is valid data in other encodings. Of course, there is your point that
> such data cannot be at the start of an SQL command.
>

Is the UTF-8 BOM ( EF BB BF ) actually valid data in any other multi-byte encoding (other than it's intended use in UTF-8)?

I realize that for single-byte encoding, such as latin-1, it would be legal as data, since any bytes other that 00 are legal, although never legal outside a quoted string in a SQL command or psql command.

Certainly, no psql command input file can start with those bytes, or you would get an error (unless it is changed so the BOM is ignored).

As to zero-width non-breaking space: the BOM is supposed to be treated as such if in the middle of a string, but if it is the start, it is just the BOM, and isn't considered part of the data, if I'm reading the spec right. Perhaps the lexers should allow for it as white space (along with other Unicode space characters, such as U+2060).
It's not really important, since allowing the BOM sequence in the middle of a file is "deprecated" according to the Unicode standard.

And what if you see a real BOM, FF FE or FE FF or FF FE 00 00 or 00 00 FE FF? Give an error saying UTF-16 and UTF-32 are not supported?

Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8, so psql and PostgreSQL understand it?
(BTW, that would actually be nice on Windows, where UTF-16 is common).

If we accept UTF-8 BOM, we should at least detect the other BOM sequences and give an error or warning.

Overall, from my user point of view, having psql deal with the BOM (at least the utf-8 one) would be more friendly than current behavior, as some editors (notepad for example) automatically add the BOM to the beginning of Unicode files, and it's not obvious without dumping the file in hex.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Albe Laurenz 2009-11-17 10:32:00 Re: Rejecting weak passwords
Previous Message Itagaki Takahiro 2009-11-17 07:40:23 Re: UTF8 with BOM support in psql