Re: Invalid EUC_JP char seq bug?

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: jc(at)mega-bucks(dot)co(dot)jp
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Invalid EUC_JP char seq bug?
Date: 2003-07-02 10:00:40
Message-ID: 20030702.190040.74753986.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> >>search_words=%B7%F6%BA%7E
> >>select id from products where name like '??~'
> >>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)
> >
> >
> > This is definitly a bad EUC_JP.
>
> According to a PHP developer in my bug report
> (http://bugs.php.net/bug.php?id=24309&edit=2):
>
> "URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
> B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
> believe encoding detection of mbstring works fine in this case.
> B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
> correct EUC-JP byte sequence."
>
> I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
> bug report to PHP and pointed this out. Hopefully the developer will see
> that this sequence is incorrect EUC-JP and that PHP failed to detect this :)

In the EUC_JP encoding there are some rules:

1) if the first byte is 0x8e then second byte is a JIS 0201 character
and should be greater than 0x7f

2) else if the first byte is 0x8f then second and third byte is a JIS
0212 character and they should be greater than 0x7f

3) else if the first byte is greater than 0x7f then second and third
byte is a JIS 0208 character and they should be greater than 0x7f

4) else the byte is ASII and should be eqaul to or less than 0x7f

Apparently:

B7F6: this is ok. we can apply rule #3
BA7E: this is not good, since it satisfies non of rule #1 to #4

> Thanks!
>
> Jean-Christian Imbeault
>
> PS I posted to HACKERS a few weeks ago about another bug (a real one :)
> in the EUC-JP translation having to do with the WAVE DASH. I'll repost
> here on the BUGS list, could you let me know the status of that BUG? Thanks!

Sorry for the delay. In EUC-JP <--> Unicode translation, WAVE DASH is
always a problem since there are several different mappings among
different vendors/standards. I think I need more time to solve this.
--
Tatsuo Ishii

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Ian Grant 2003-07-02 12:20:42 7.3.3 configure should check for curses before readline
Previous Message Tom Lane 2003-07-02 04:20:52 Re: pg_dump -t option doesn't take schema-qualified table names