Re: Reducing the overhead of NUMERIC data

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Maxwell <gmaxwell(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing the overhead of NUMERIC data
Date: 2005-11-04 15:13:29
Message-ID: 20051104151327.GB13966@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Fri, Nov 04, 2005 at 08:38:38AM -0500, mark(at)mark(dot)mielke(dot)cc wrote:
> On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote:
> > Actually, the real reason we use UTF-8 and not any of the
> > sorta-fixed-size representations of Unicode is that the backend is by
> > and large an ASCII, null-terminated-string engine. *All* of the
> > supported backend encodings are ASCII-superset codes. Making
> > everything null-safe in order to allow use of UCS2 or UCS4 would be
> > a huge amount of work, and the benefit is at best questionable.
>
> Perhaps on a side note - my intuition (which sometimes lies) would tell
> me that, if the above is true, the backend is doing unnecessary copies
> of read-only data, if only, to insert a '\0' at the end of the strings.
> Is this true?

It's not quite that bad. Obviously for all on disk datatype zeros are
allowed. Bit strings, arrays, timestamps, numerics can all have
embedded nulls and they have a length header.

Where this becomes an issue is for things like table names, field
names, encoding names, etc. The "name" type is a fixed length string
which is kept in a way that it can be treated as a C string. If these
could contain null characters it would get messy.

I do conceive that the backend could support a UTF-16 datatype which
would be indexable and have various support functions. But as soon as
it came to talking to clients, it would be converted back to UTF-8
because libpq treats all strings coming back as null terminated.
Similarly, querys sent couldn't be anything other than UTF-8 also.

And if users can't send or receive UTF-16 text, why should the backend
store it that way?

> I'm thinking along the lines of the other threads that speak of PostgreSQL
> being CPU or I/O bound, not disk bound, for many sorts of operations. Is
> PostgreSQL unnecessary copying string data around (and other data, I would
> assume).

Well, there is a bit of copying around while creating tuples and such,
but it's not to add null terminators.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2005-11-04 15:35:30 Re: postgresql-8.1RC1 on Solaris 10, amd64x2
Previous Message mark 2005-11-04 13:38:38 Re: Reducing the overhead of NUMERIC data

Browse pgsql-patches by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2005-11-04 17:43:14 Re: AIX FAQ addition
Previous Message mark 2005-11-04 13:38:38 Re: Reducing the overhead of NUMERIC data