Re: make_greater_string() does not return a string in some cases

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2011-07-08 09:21:16
Message-ID: 20110708.182116.44187733.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello, Could you let me go on with this topic?

It is hard to ignore this glitch for us using CJK - Chinese,
Japanese, and Korean - characters on databse.. Maybe..

Saying on Japanese under the standard usage, about a hundred
characters out of seven thousand make make_greater_string() fail.

This is not so frequent to happen but also not as rare as
ignorable.

I think this glitch is caused because the method to derive the
`next character' is fundamentally a secret of each encoding but
now it is done in make_greater_string() using the method extended
from that of 1 byte ASCII charset for all encodings together.

So, I think it is reasonable that encoding info table (struct
pg_wchar_tbl) holds the function to do that.

How about this idea?

Points to realize this follows,

- pg_wchar_tbl(at)pg_wchar(dot)c has new element `charinc' that holds a
function to increment a character of this encoding.

- Basically, the value of charinc is a `generic' increment
function that does what make_greater_string() does in current
implement.

- make_greater_string() now uses charinc for database encoding to
increment characters instead of the code directly written in
it.

- Give UTF-8 a special increment function.

As a consequence of this modification, make_greater_string()
looks somewhat simple thanks to disappearing of the sequence that
handles bare bytes in string. And doing `increment character'
with the knowledge of the encoding can be straightforward and
light and backtrack-free, and have fewer glitches than the
generic method.

# But the process for BYTEAOID remains there dissapointingly.

There still remains some glitches but I think it is overdo to do
conversion that changes the length of the character. Only 5
points out of 17 thousands (in current method, roughly for all
BMP characters) remains, and none of them are not Japanese
character :-)

The attached patch is sample implement of this idea.

What do you think about this patch?

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
unknown_filename text/plain 16.2 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dmitry 2011-07-08 11:33:09 BUG #6101: ALTER TABLE hangs with AccessExclusiveLock
Previous Message zhaowy 2011-07-08 08:20:45 BUG #6099: Does pgcluster support hibernate?

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-07-08 09:26:54 Re: [COMMITTERS] pgsql: Adjust OLDSERXID_MAX_PAGE based on BLCKSZ.
Previous Message Kohei KaiGai 2011-07-08 09:09:54 Re: [v9.2] Fix leaky-view problem, part 2