Re: Proposal - Support for National Characters functionality

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Arulappan, Arul Shaji" <arul(at)fast(dot)au(dot)fujitsu(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal - Support for National Characters functionality
Date: 2013-07-30 12:35:08
Message-ID: 6046.1375187708@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Arulappan, Arul Shaji" <arul(at)fast(dot)au(dot)fujitsu(dot)com> writes:
> Given below is a design draft for this functionality:

> Core new functionality (new code):
> 1)Create and register independent NCHAR/NVARCHAR/NTEXT data types.

> 2)Provide support for the new GUC nchar_collation to provide the
> database with information about the default collation that needs to be
> used for the new data types.

A GUC seems like completely the wrong tack to be taking. In the first
place, that would mandate just one value (at a time anyway) of
collation, which is surely not much of an advance over what's already
possible. In the second place, what happens if you change the value?
All your indexes on nchar columns are corrupt, that's what. Actually
the data itself would be corrupt, if you intend that this setting
determines the encoding and not just the collation. If you really are
speaking only of collation, it's not clear to me exactly what this
proposal offers that can't be achieved today (with greater security,
functionality and spec compliance) by using COLLATE clauses on plain
text columns.

Actually, you really haven't answered at all what it is you want to do
that COLLATE can't do.

> 4)Because all symbols from non-UTF8 encodings could be represented as
> UTF8 (but the reverse is not true) comparison between N* types and the
> regular string types inside database will be performed in UTF8 form.

I believe that in some Far Eastern character sets there are some
characters that map to the same Unicode glyph, but that some people
would prefer to keep separate. So transcoding to UTF8 isn't necessarily
lossless. This is one of the reasons why we've resisted adopting ICU or
standardizing on UTF8 as the One True Database Encoding. Now this may
or may not matter for comparison to strings that were in some other
encoding to start with --- but as soon as you base your design on the
premise that UTF8 is a universal encoding, you are sliding down a
slippery slope to a design that will meet resistance.

> 6)Client input/output of NATIONAL strings - NATIONAL strings will
> respect the client_encoding setting, and their values will be
> transparently converted to the requested client_encoding before
> sending(receiving) to client (the same mechanics as used for usual
> string types).
> So no mixed encoding in client input/output will be supported/available.

If you have this restriction, then I'm really failing to see what
benefit there is over what can be done today with COLLATE.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-07-30 13:58:57 Re: ToDo: possible more rights to database owners
Previous Message Greg Smith 2013-07-30 12:22:31 Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])