> I just rearranged the code in mbutils.c a little bit to make it more
> robust if conversion of an over-length string is attempted, and noted
> this comment:
> * When converting strings between different encodings, we assume that space
> * for converted result is 4-to-1 growth in the worst case. The rate for
> * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
> * kanna -> UTF8 is the worst case). So "4" should be enough for the moment.
> * Note that this is not the same as the maximum character width in any
> * particular encoding.
> #define MAX_CONVERSION_GROWTH 4
> It strikes me that this is overly pessimistic, since we do not support
> 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> in any supported encoding that require 4 bytes in another. Could we
> reduce the multiplier to 3? Or even 2? This has a direct impact on the
> longest COPY lines we can support, so I'd like it not to be larger than
I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
So the worst case is now 6, rather than 3.
Can we add a column to pg_conversion which represents the "growth
rate"? This would reduce the rate for most encodings much smaller than
SRA OSS, Inc. Japan
In response to
pgsql-hackers by date
|Next:||From: Bruce Momjian||Date: 2007-05-29 00:35:34|
|Subject: Re: CREATE TABLE LIKE INCLUDING INDEXES support|
|Previous:||From: Bruce Momjian||Date: 2007-05-29 00:18:55|
|Subject: TOAST usage setting|