Re: Reducing the overhead of NUMERIC data

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, mark(at)mark(dot)mielke(dot)cc, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing the overhead of NUMERIC data
Date: 2005-11-04 23:40:33
Message-ID: 20051104234026.GF13966@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Fri, Nov 04, 2005 at 02:58:05PM -0500, Gregory Maxwell wrote:
> The correct question to ask is something like "Does it support non-bmp
> characters?" or "Does it really support UTF-16 or just UCS2?"
>
> UTF-16 is (now) a variable width encoding which is a strict superset
> of UCS2 which allows the representation of all Unicode characters.
> UCS2 is fixed width and only supports characters from the basic
> multilingual plane. UTF-32 and UCS4 are (now) effectively the same
> thing and can represent all unicode characters with a 4 byte fixed
> length word.

It's all on their website:

: How is a Unicode string represented in ICU?
:
: A Unicode string is currently represented as UTF-16 by default. The
: endianess of UTF-16 is platform dependent. You can guarantee the
: endianess of UTF-16 by using a converter. UTF-16 strings can be
: converted to other Unicode forms by using a converter or with the UTF
: conversion macros.
:
: ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not
: support surrogates, and UTF-16 does support surrogates. This means
: that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The
: notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed
: its default encoding to UTF-16.
<snip>
: What is the performance difference between UTF-8 and UTF-16?
:
: Most of the time, the memory throughput of the hard drive and RAM is
: the main performance constraint. UTF-8 is 50% smaller than UTF-16 for
: US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South
: Asian scripts. There is no memory difference for Latin extensions,
: Greek, Cyrillic, Hebrew, and Arabic.
<snip>
http://icu.sourceforge.net/userguide/icufaq.html

: Using UTF-8 strings with ICU
:
: As mentioned in the overview of this chapter, ICU and most other
: Unicode-supporting software uses 16-bit Unicode for internal
: processing. However, there are circumstances where UTF-8 is used
: instead. This is usually the case for software that does little or no
: processing of non-ASCII characters, and/or for APIs that predate
: Unicode, use byte-based strings, and cannot be changed or replaced
: for various reasons.
<snip>
: While ICU does not natively use UTF-8 strings, there are many ways to
: work with UTF-8 strings and ICU. The following list is probably
: incomplete.
http://icu.sourceforge.net/userguide/strings.html#strings

Basically you use a "converter" to process the UTF-8 strings,
prusumably converting them to UTF-16 (which is not UCS-2 as noted
above). UTF-32 needs a converter also, so no point using that either.

> The code can demand UTF-16 but still be fine for non-BMP characters.
> However, many things which claim to support UTF-16 really only support
> UCS2 or at least have bugs in their handling of non-bmp characters.
> Software that supports UTF-8 is somewhat more likely to support
> non-bmp characters correctly since the variable length code paths get
> more of a workout in many environments. :)

I think ICU deals with that, but feel free to peruse the website
yourself...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Creager 2005-11-05 00:02:21 Re: Seeing context switch storm with 10/13 snapshot of
Previous Message Jim C. Nasby 2005-11-04 22:42:51 Re: Assert failure found in 8.1RC1

Browse pgsql-patches by date

  From Date Subject
Next Message Christopher Browne 2005-11-04 23:56:03 Re: AIX FAQ addition
Previous Message Jim C. Nasby 2005-11-04 22:05:07 Re: Reducing the overhead of NUMERIC data