SQL_ASCII vs. 7-bit ASCII encodings

From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: SQL_ASCII vs. 7-bit ASCII encodings
Date: 2005-05-12 02:42:36
Message-ID: 4282C29C.4020000@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The SQL_ASCII-breaks-JDBC issue just came up yet again on the JDBC list,
and I'm wondering if we can do something better on the server side to
help solve it.

The problem is that people have SQL_ASCII databases with non-7-bit data
in them under some encoding known only to a (non-JDBC) application.
Changing client_encoding has no effect on a SQL_ASCII database, it's
always passthrough. So when a JDBC client is later written, and the JDBC
driver sets client_encoding=UNICODE, we get data corruption and/or
complaints from the driver that the server is sending it invalid unicode
(because it's really LATIN1 or whatever the original inserter happened
to use).

At this point the user has real problems as there is existing data in
their database in one or more encodings, but the encoding info
associated with that data has been lost. Converting such a database to a
single database-wide encoding is painful at best.

I suppose that we can't change the semantics of SQL_ASCII without
backwards compatibility problems. I wonder if introducing a new encoding
that only allows 7-bit ascii, and making that the default, is the way to
go.

This new encoding would be treated like any other normal encoding, i.e.
setting client_encoding does transcoding (I expect that'd be a 1:1
mapping in most or all cases) and rejects unmappable characters as soon
as they're encountered.

Then the problem is visible as soon as problematic strings are given to
the server, rather than when a client that depends on having proper
encoding information (such as JDBC) happens to be used. If the DB is
only using simple 7-bit ASCII, then there's no change in behaviour. If
the DB does need to store additional characters, the user is forced to
choose an appropriate encoding before any encoding info is lost.

Any thoughts on this?

-O

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2005-05-12 02:55:02 Re: SQL_ASCII vs. 7-bit ASCII encodings
Previous Message Christopher Kings-Lynne 2005-05-12 02:35:31 Re: patches for items from TODO list