JDBC driver is too fussy about Unicode

From: James Pharaoh <james(at)phsys(dot)co(dot)uk>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: JDBC driver is too fussy about Unicode
Date: 2011-10-28 19:50:43
Message-ID: CALSKDaQw6qbZdjV+DobTNOtuUVxco_rzfEe0=uiRWKLjCSiq9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

Hi,

I have been trying to track down a bug in an application. This is a
message broker for a chat application, written in Hibernate using
PostgreSQL underneath. I was rather confused when I started getting
decoding errors from data written into the database by the same
application.

Here's part of the stack trace from one such exception:

org.hibernate.exception.DataException: could not load an entity:
[com.pharaohsystems.txt2.db.TextRec#24413864]
(...)
Caused by: org.postgresql.util.PSQLException: Invalid character data
was found. This is most likely caused by stored data containing
characters that are invalid for the character set the database was
created in. The most common example of this is storing 8bit data in a
SQL_ASCII database.
at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1891)
at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2190)

I've gone in and looked at the underlying record. It contains the
following character:

http://www.fileformat.info/info/unicode/char/1f60a/index.htm

Actually the exception above was generated after I went in and
doctored the database so that the field in question contains only this
particular character.

I am confused because this is clearly a valid UTF-8 character. This
can be easily seen by matching the UTF-8 character code shows on the
page above, f09f988a, with the output from a simple query:

txt2=# select text, encode (text::bytea, 'hex') from text where id = 24413864;
text | encode
------+----------
😊 | f09f988a
(1 row)

Thinking about this I have come to the conclusion that it is because
of the Unicode version being used. This character is present in
Unicode 6.0.0, but I am using Java 6, which apparently uses Unicode
4.0.

My point is that this seems like an absurd restriction. Java allows me
to use characters which aren't included in the Unicode specification
it was built with. The postgresql driver allows me to store these
values. But when I try and bring them out it gives me an error.

This is a simple and well defined conversion. This code point exists
in both UTF-8 (the encoding used by the database) and UTF-16 (the
encoding used in Java). The conversion has clearly taken place to
store this in the database. Why can't it pull it out?

I can only think of two solutions (or workarounds). One is to perform
the conversion myself. This is a pain because with a lot of user
specified data these characters could get all over the place. The
other is to filter out these characters. Neither of these are very
appealing.

Does anyone have any thoughts on this? I could of course be wrong and
there could be something else going on here. Any ideas?

Thanks,
James

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Oliver Jowett 2011-10-29 00:17:55 Re: JDBC driver is too fussy about Unicode
Previous Message Stevo Slavić 2011-10-28 12:20:44 Publish 9.1-901 JDBC drivers on maven central