Re: Proposal - Support for National Characters functionality

From: "Boguk, Maksym" <maksymb(at)fast(dot)au(dot)fujitsu(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Arulappan, Arul Shaji" <arul(at)fast(dot)au(dot)fujitsu(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal - Support for National Characters functionality
Date: 2013-07-31 07:50:21
Message-ID: A756FAD7EDC2E24F8CAB7E2F3B5375E918B12BC0@FALEX03.au.fjanz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

I will try answer on all questions related to proposed National
Characters support.

>> 2)Provide support for the new GUC nchar_collation to provide the
>> database with information about the default collation that needs to
be
>> used for the new data types.

>A GUC seems like completely the wrong tack to be taking. In the first
place, that would mandate just one value (at a time anyway) of
collation, which is surely not much of an advance over what's already
possible. In the second place, what happens if you change the value?
>All your indexes on nchar columns are corrupt, that's what. Actually
the data itself would be corrupt, if you intend that this setting
determines the encoding and not just the collation. If you really are
speaking only of collation, it's not clear to me exactly what this
proposal offers that can't be achieved today (with greater security,
>functionality and spec compliance) by using COLLATE clauses on plain
text columns.
>Actually, you really haven't answered at all what it is you want to do
that COLLATE can't do.

I think I give a wrong description there... it will be not GUC but
GUC-type value which will be initialized during CREATE DATABASE and will
be read only after, very similar to the lc_collate.
So I think name national_lc_collate will be better.
Function of this value - provide information about the default collation
for the NATIONAL CHARACTERS inside the database.
That's not limits user ability of use an alternative collation for
NATIONAL CHARACTERS during create table via COLLATE keyword.

E.g. if we have second encoding inside the database - we should have
information about used collation somewhere.

>> 4)Because all symbols from non-UTF8 encodings could be represented as
>> UTF8 (but the reverse is not true) comparison between N* types and
the
>> regular string types inside database will be performed in UTF8 form.

>I believe that in some Far Eastern character sets there are some
characters that map to the same Unicode glyph, but that some people
would prefer to keep separate. So transcoding to UTF8 isn't necessarily
lossless. This is one of the reasons why we've resisted adopting ICU or
standardizing on UTF8 as the One True Database Encoding. >Now this may
or may not matter for comparison to strings that were in some other
encoding to start with --- but as soon as you base your design on the
premise that UTF8 is a universal encoding, you are sliding down a
slippery slope to a design that will meet resistance.

Will the conversion of both sides to the pg_wchar before comparison fix
this problem?
Anyway, if the database going to use more than one encoding, a some
universal encoding should be used to allow comparison between them.
After some analyse I think pg_wchar is better candidate to this role
than UTF8.

>> 6)Client input/output of NATIONAL strings - NATIONAL strings will
>> respect the client_encoding setting, and their values will be
>> transparently converted to the requested client_encoding before
>> sending(receiving) to client (the same mechanics as used for usual
>> string types).
>> So no mixed encoding in client input/output will be
supported/available.

>If you have this restriction, then I'm really failing to see what
benefit there is over what can be done today with COLLATE.

There are two targets for this project:

1. Legacy database with non-utf8 encoding, which should support old
non-utf8 applications and new UTF8 applications.
In that case the old applications will use the legacy database encoding
(and because these applications are legacy they doesn't work with new
NATIONAL CHARACTERS data/tables).
And the new applications will use client-side UTF8 encoding and they
will be able store international texts in NATIONAL CHARACTER columns.
Dump/restore of the whole database to change the database encoding to
UTF8 not always possible, so there necessity of the some easy to use
workaround.

2.Better compatibility with the ANSI SQL standard.

Kind Regards,
Maksym

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2013-07-31 09:50:50 Typo fix in bufmgr.c
Previous Message Hitoshi Harada 2013-07-31 06:51:32 Small comment fix in sinvaladt.c