Re: Errors in our encoding conversion tables

From: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "'Tom Lane *EXTERN*'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Cc: Tatsuo Ishii <ishii(at)postgreSQL(dot)org>
Subject: Re: Errors in our encoding conversion tables
Date: 2015-11-27 08:49:37
Message-ID: A737B7A37273E048B164557ADEF4A58B50FECB63@ntex2010i.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> There's a discussion over at
> http://www.postgresql.org/message-id/flat/2sa(dot)Dhu5(dot)1hk1yrpTNFy(dot)1MLOlb(at)seznam(dot)cz
> of an apparent error in our WIN1250 -> LATIN2 conversion. I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification. I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff. (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
>
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones. The attached patch
> recomputes those from the Unicode data, too.
>
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
>
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
>
> Anyway, what are we going to do about this? I'm concerned that simply
> shoving in corrections may cause problems for users. Almost certainly,
> we should not back-patch this kind of change.

Thanks for picking this up.

I agree with your proposed fix, the only thing that makes me feel uncomfortable
is that you get error messages like:
ERROR: character with byte sequence 0x96 in encoding "WIN1250" has no equivalent in encoding "MULE_INTERNAL"
which is a bit misleading.
But the main thing is that no corrupt data can be entered.

I can understand the reluctance to back-patch; nobody likes his
application to suddenly fail after a minor database upgrade.

However, the people who would fail if this were back-patched are
people who will certainly run into trouble if they
a) upgrade to a release where this is fixed or
b) try to convert their database to, say, UTF8.

The least thing we should do is stick a fat warning into the release notes
of the first version where this is fixed, along with some guidelines what
to do (though I am afraid that there is not much more helpful to say than
"If your database encoding is X and data have been entered with client_encoding Y,
fix your data in the old system").

But I think that this fix should be applied to 9.6.
PostgreSQL has a strong reputation for being strict about correct encoding
(not saying that everybody appreciates that), and I think we shouldn't mar
that reputation.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2015-11-27 09:32:10 Re: Getting sorted data from foreign server for merge join
Previous Message Michael Paquier 2015-11-27 07:59:20 Re: Error with index on unlogged table