Re: Errors in our encoding conversion tables

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Errors in our encoding conversion tables
Date: 2015-11-27 02:00:27
Message-ID: 20151127.110027.1989081859519291674.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> There's a discussion over at
> http://www.postgresql.org/message-id/flat/2sa(dot)Dhu5(dot)1hk1yrpTNFy(dot)1MLOlb(at)seznam(dot)cz
> of an apparent error in our WIN1250 -> LATIN2 conversion. I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification. I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff. (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
>
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones. The attached patch
> recomputes those from the Unicode data, too.
>
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
>
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
>
> Anyway, what are we going to do about this? I'm concerned that simply
> shoving in corrections may cause problems for users. Almost certainly,
> we should not back-patch this kind of change.

I have started to looking into it. I wonder how do you create the part
of your patch:

*** 154,163 ****
win12502mic(const unsigned char *l, unsigned char *p, int len)
{
static const unsigned char win1250_2_iso88592[] = {
! 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
! 0x88, 0x89, 0xA9, 0x8B, 0xA6, 0xAB, 0xAE, 0xAC,
! 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97,
! 0x98, 0x99, 0xB9, 0x9B, 0xB6, 0xBB, 0xBE, 0xBC,
0xA0, 0xB7, 0xA2, 0xA3, 0xA4, 0xA1, 0x00, 0xA7,
0xA8, 0x00, 0xAA, 0x00, 0x00, 0xAD, 0x00, 0xAF,
0xB0, 0x00, 0xB2, 0xB3, 0xB4, 0x00, 0x00, 0x00,
--- 154,163 ----
win12502mic(const unsigned char *l, unsigned char *p, int len)
{
static const unsigned char win1250_2_iso88592[] = {
! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 0x00, 0x00, 0xA9, 0x00, 0xA6, 0xAB, 0xAE, 0xAC,
! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 0x00, 0x00, 0xB9, 0x00, 0xB6, 0xBB, 0xBE, 0xBC,
0xA0, 0xB7, 0xA2, 0xA3, 0xA4, 0xA1, 0x00, 0xA7,
0xA8, 0x00, 0xAA, 0x00, 0x00, 0xAD, 0x00, 0xAF,
0xB0, 0x00, 0xB2, 0xB3, 0xB4, 0x00, 0x00, 0x00,

In the above you seem to disable the conversion from 0x96 of win1250
to ISO-8859-2 by using the Unicode mapping files in
src/backend/utils/mb/Unicode. But the corresponding mapping file
(iso8859_2_to_utf8.amp) does include following entry:

{0x0096, 0xc296},

How do you know 0x96 should be removed from the conversion?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message XiaoChuan Yu 2015-11-27 02:35:48 How to add and use a static library within Postgres backend
Previous Message Alvaro Herrera 2015-11-26 23:26:14 Re: New email address