Re: Reducing the overhead of NUMERIC data

From: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: mark(at)mark(dot)mielke(dot)cc, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing the overhead of NUMERIC data
Date: 2005-11-03 19:06:02
Message-ID: e692861c0511031106i3c9b535o677aefd2179a7bb6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On 11/3/05, Martijn van Oosterhout <kleptog(at)svana(dot)org> wrote:
> That's called UTF-16 and is currently not supported by PostgreSQL at
> all. That may change, since the locale library ICU requires UTF-16 for
> everything.

UTF-16 doesn't get us out of the variable length character game, for
that we need UTF-32... Unless we were to only support UCS-2, which is
what some databases do for their Unicode support. I think that would
be a huge step back and as you pointed out below, it is not efficient.
:)

> The question is, if someone declares a field CHAR(20), do they really
> mean to fix 40 bytes of storage for each and every row? I doubt it,
> that's even more wasteful of space than a varlena header.
>
> Which puts you right back to variable length fields.

Another way to look at this is in the context of compression: With
unicode, characters are really 32bit values... But only a small range
of these values is common. So we store and work with them in a
compressed format, UTF-8.

The costs of compression is that fixed width fields can not be fixed
width, and the some operations are much more expensive than they would
be otherwise.

As such it might be more interesting to ask some other questions like:
are we using the best compression algorithm for the application, and,
why do we sometimes stack two compression algorithms? For longer
fields would we be better off working with UTF-32 and being more
agressive about where we LZ compress the fields?

> > I dunno... no opinion on the matter here, but I did want to point out
> > that the field can be fixed length without a header. Those proposing such
> > a change, however, should accept that this may result in an overall
> > expense.
>
> The only time this may be useful is for *very* short fields, in the
> order of 4 characters or less. Else the overhead swamps the varlena
> header...

Not even 4 characters if we are to support all of unicode... Length +
UTF-8 is a win vs UTF-32 in most cases for fields with more than one
character.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew - Supernews 2005-11-03 19:40:15 Re: Exclusive lock for database rename
Previous Message Simon Riggs 2005-11-03 18:29:09 Re: Spinlocks, yet again: analysis and proposed patches

Browse pgsql-patches by date

  From Date Subject
Next Message Merlin Moncure 2005-11-03 19:31:58 Re: Limit usage of tcop/dest.h
Previous Message Martijn van Oosterhout 2005-11-03 18:18:55 Re: Reducing the overhead of NUMERIC data