Re: JDBC driver patch for non-ASCII users

From: sulfinu(at)gmail(dot)com
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kris Jurka <books(at)ejurka(dot)com>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: JDBC driver patch for non-ASCII users
Date: 2007-12-11 14:46:13
Message-ID: 200712111646.13299.sulfinu@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

On Saturday 08 December 2007, Tom Lane wrote:
> Given the current design that allows different databases in a cluster
> to (claim they) have different encodings, it's real hard to see how
> to handle non-ASCII data in shared catalogs sanely. I don't think
> we'll really be able to fix this properly until that mythical day
> when we have support for per-column encoding selections. My guess
> is that we'd then legislate that shared catalog columns are always
> UTF8; after which we could start to think about what it would take
> to do conversion of the connection startup packet's contents from
> client-side encoding to UTF8.
First of all, judgeing from the code I read, you'll have to adjust the wire
protocol so that the encoding is signaled at the very beginning of a
connection! The V3 protocol seems close, but not just there. Take for example
the way that the encoding information is processed when XML reader programs.

Next, there's something I already suggested on the "-hackers" mailing list.
Until the day when PostgreSQL is rewritten in a Unicode-savvy language (where
a "char" is indeed a Unicode point), I believe you should consider enforcing
for any database cluster a single encoding chosen from the encodings that
cover the whole Unicode set, like UTF-8, UTF-16 etc. This way, lots of
problems disappear, things get cleaner and clients need not guess the
encoding used at server side for user name, password, database name, table
names and so on. Collation rules would finally depend on locale solely, just
as they should.

The only downside that I see is an (slight) increase in database size, but
that's not an issue nowadays. Perhaps you could offer administrators a choice
of encoding upon cluster creation, that would statistically minimize the
size, depending on the mostly used languages.
If you have other reasons against it, bring them to the table, but please do
not post ridiculous statements like "I'm not sure a Java char is a Unicode
point" or "I don't think that Unicode covers all languages", which I didn't
even bother to answer with the classical "RTFM!".

Support for per-column encoding selection is from my point of view a stupid
waste of developing effort and CPU time, not to mention it is a great
opportunity to introduce a myriad of bugs. You're looking at the problem from
the wrong end: it is not the encodings that must be flexibly chosen, it is
the alphabet! No user is ever going to be interested in the internal encoding
of a Postgres database file, nor should he be. But the user will always
appreciate to find again the same strings as he has put in the database,
regardless of his mother tongue and client program. The logical solution is
to support Unicode and disregard encodings altogether (actually, keep them
under the sheets, since they are a result of historical limitations).

On Saturday 08 December 2007, Kris Jurka wrote:
> For the record, I'm in favor of changing our use of initial setup encoding
> from SQL-ASCII to UTF-8. While it doesn't solve the root of the problem,
> it does allow people to use non-ascii user and database names if they set
> them up appropriately and doesn't seem to harm anything.
Will you change ALL clients in order to do that? I only needed one client to
actually work, JDBC - very frustrating, since it was supposed to be
Unicode-proof, written in Java. Ironically, psql works because it uses the
platform encoding ;)

> The original
> patch's suggested use of the client's environment encoding seems random to
> me.
It's not random, it is a heuristical approach in guessing the right encoding,
namely the encoding used by the administrator when he created the database
and the user. Afterall, there cannot be anything random in a computer, can
it?
My solution preserves the currently working configurations - the ASCII-only
setups will continue to work after the patch is applied. Moreover, UTF-8
setups are guaranteed to always work!
In short, my patch solves today(!) with no undesired side-effects a limitation
of the PostgreSQL authentication procedure in the JDBC driver. You're free to
reject it, I published it for the general benefit (as it happens, you asked
it yourself).

Good luck.

In response to

Browse pgsql-jdbc by date

  From Date Subject
Next Message sulfinu 2007-12-11 15:22:18 Re: JDBC driver patch for non-ASCII users
Previous Message Oliver Jowett 2007-12-11 13:12:15 Re: JDBC driver patch for non-ASCII users