Re: Fixed length data types issue

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <gsstark(at)mit(dot)edu>, Bruce Momjian <bruce(at)momjian(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Fixed length data types issue
Date: 2006-09-11 10:03:03
Message-ID: 87fyeyyb3c.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <gsstark(at)mit(dot)edu> writes:
>> I'm a bit confused by this and how it would be handled in your sketch. I
>> assumed we needed a bit pattern dedicated to 4-byte length headers because
>> even though it would never occur on disk it would be necessary to for the
>> uncompressed and/or detoasted data.
>
>> In your scheme what would PG_GETARG_TEXT() give you if the data was detoasted
>> to larger than 16k?
>
> I'm imagining that it would give you the same old uncompressed in-memory
> representation as it does now, ie, 4-byte length word and uncompressed
> data.

Sure, but how would you know? Sometimes you would get a pointer to a varlena
starting with a bytes with a leading 00 indicating a 1-byte varlena header and
sometimes you would get a pointer to a varlena with the old uncompressed
representation with a 4-byte length header which may well start with a 00.

> * If high order bit of first byte is 1, then it's some compressed
> variant. I'd propose divvying up the code space like this:
>
> * 0xxxxxxx uncompressed 4-byte length word as stated above
> * 10xxxxxx 1-byte length word, up to 62 bytes of data
> * 110xxxxx 2-byte length word, uncompressed inline data
> * 1110xxxx 2-byte length word, compressed inline data
> * 1111xxxx 1-byte length word, out-of-line TOAST pointer

I'm unclear how you're using the remaining bits. Are you saying you would have
a 4-byte length word following this bit-flag byte? Or are you saying we would
use 31 bits for the 4-byte length word, 13 bits for the 2-byte uncompressed
length word and 12 bits for the compressed length word?

Also Heikki points out here that it would be nice to allow for the case for a
0-byte header. So for example if the leading bit is 0 then the remaining 7
bits are available for the datum itself. This would actually vacate much of my
argument for a fixed length char(n) data type. The most frequent use case is
for things like CHAR(1) fields containg 'Y' or 'N'.

In any case it seems a bit backwards to me. Wouldn't it be better to preserve
bits in the case of short length words where they're precious rather than long
ones? If we make 0xxxxxxx the 1-byte case it means limiting our maximum datum
size to something like .5G but if you're working with .5G data wouldn't you be
using an api that lets you access it by chunks anyways?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Albe Laurenz 2006-09-11 10:13:29 Re: [HACKERS] Fix linking of OpenLDAP libraries
Previous Message Adrian Maier 2006-09-11 09:54:17 Re: Cassowary failing to report the results back to the farm