Re: psql blows up on BOM character sequence

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: psql blows up on BOM character sequence
Date: 2014-03-24 23:05:19
Message-ID: 24831.1395702319@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jim Nasby <jim(at)nasby(dot)net> writes:
> Wait... I thought that was one of the objections... that we wanted to
> leave a BOM in something like a COPY untouched?

I think most of us are okay with stripping a BOM that appears at the
*beginning* of a text file (assuming there's reason to believe the file
is in UTF8 encoding). BOM sequences embedded later in the file are a lot
more debatable, and I for one don't want to assume those can be dropped.
I don't know of any legitimate usage of such cases, and think it's
probably better to report an encoding error.

> Uh... could we just treat BOM as another whitespace character?

A BOM is *most certainly not* whitespace. The only even semi-legitimate
usage it has in UTF8 is as a file encoding marker. You can bet that the
user whose text editor made the file did not think he had whitespace at
the front. Anyway, your proposition that leading whitespace is ignorable
fails completely for data files.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-03-25 00:07:35 Re: Only first XLogRecData is visible to rm_desc with WAL_DEBUG
Previous Message Jim Nasby 2014-03-24 21:37:22 Re: psql blows up on BOM character sequence