Re: Fixed length data types issue

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <gsstark(at)mit(dot)edu>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Fixed length data types issue
Date: 2006-09-11 01:16:51
Message-ID: 6043.1157937411@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark <gsstark(at)mit(dot)edu> writes:
> I'm a bit confused by this and how it would be handled in your sketch. I
> assumed we needed a bit pattern dedicated to 4-byte length headers because
> even though it would never occur on disk it would be necessary to for the
> uncompressed and/or detoasted data.

> In your scheme what would PG_GETARG_TEXT() give you if the data was detoasted
> to larger than 16k?

I'm imagining that it would give you the same old uncompressed in-memory
representation as it does now, ie, 4-byte length word and uncompressed
data.

The weak spot of the scheme is that it assumes different, incompatible
in-memory and on-disk representations. This seems to require either
(a) coercing values to in-memory form before they ever get handed to any
datatype manipulation function, or (b) thinking of some magic way to
pass out-of-band info about the contents of the datum. (b) is the same
stumbling block we have in connection with making typmod available to
datatype manipulation functions. I don't want to reject (b) entirely,
but it seems to require some pretty major structural changes.

OTOH (a) is not very pleasant either, and so what would be nice is if
we could tell by inspection of the Datum alone which format it's in.

After further thought I have an alternate proposal that does that,
but it's got its own disadvantage: it requires storing uncompressed
4-byte length words in big-endian byte order everywhere. This might
be a showstopper (does anyone know the cost of ntohl() on modern
Intel CPUs?), but if it's not then I see things working like this:

* If high order bit of datum's first byte is 0, then it's an
uncompressed datum in what's essentially the same as our current
in-memory format except that the 4-byte length word must be big-endian
(to ensure that the leading bit can be kept zero). In particular this
format will be aligned on 4- or 8-byte boundary as called for by the
datatype definition.

* If high order bit of first byte is 1, then it's some compressed
variant. I'd propose divvying up the code space like this:

* 0xxxxxxx uncompressed 4-byte length word as stated above
* 10xxxxxx 1-byte length word, up to 62 bytes of data
* 110xxxxx 2-byte length word, uncompressed inline data
* 1110xxxx 2-byte length word, compressed inline data
* 1111xxxx 1-byte length word, out-of-line TOAST pointer

This limits us to 8K uncompressed or 4K compressed inline data without
toasting, which is slightly annoying but probably still an insignificant
limitation. It also means more distinct cases for the heap_deform_tuple
inner loop to think about, which might be a problem.

Since the compressed forms would not be aligned to any boundary,
there's an important special case here: how can heap_deform_tuple tell
whether the next field is compressed or not? The answer is that we'll
have to require pad bytes between fields to be zero. (They already are
zeroed by heap_form_tuple, but now it'd be a requirement.) So the
algorithm for decoding a non-null field is:

* if looking at a byte with high bit 0, then we are either
on the start of an uncompressed field, or on a pad byte before
such a field. Advance to the declared alignment boundary for
the datatype, read a 4-byte length word, and proceed.

* if looking at a byte with high bit 1, then we are at the
start of a compressed field (which will never have any preceding
pad bytes). Decode length as per rules above.

The good thing about this approach is that it requires zero changes to
fundamental system structure. The pack/unpack rules in heap_form_tuple
and heap_deform_tuple change a bit, and the mechanics of
PG_DETOAST_DATUM change, but a Datum is still just a pointer and you
can always tell what you've got by examining the pointed-to data.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-09-11 01:31:04 Re: Fixed length data types issue
Previous Message Christopher Browne 2006-09-11 00:51:40 Re: pgsql: Install a cleaner solution to the AIX libpq linking problem, as