Pie-in-sky dreaming about reworking tuple layout entirely

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Pie-in-sky dreaming about reworking tuple layout entirely
Date: 2006-10-03 19:56:33
Message-ID: 87irj16umm.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I can't shake the feeling that merely tweaking the way our varlenas work with
a shortvarlena or with compressed varlena headers is missing the real source
of our headaches. It seems very strange to me to be trying to step through a
tuple with length bits at the head of every field. It's a lot of work spent
dealing with a terribly inconvenient format when we can pick the format to be
whatever we like.

If we were designing this from scratch I don't think would have picked this
approach at all. We would instead store offsets to every field whose offset
isn't static in the header so we can jump straight to a field without having
to look at all the previous fields at all.

I feel this would be far to invasive a change for me to contemplate currently
though. I don't know enough about what consequences it would have. I just
wanted to get the idea out there in case sometime down the line, perhaps even
not for several releasees, I might be prepared or someone else might be
prepared to tackle such a project. It might make a lot of things cleaner and
easier such as dealing with user defined fixed length fields whose length
depends on the typmod.

What I picture would be basically be an array of offsets into the tuple. A
field would be the data from its offset to the next. We would have to store an
offset to the end of the tuple as well, and probably another field with the
length of the array so we can find the end quickly without counting bits in
the null bitmap.

I also picture a kind of compression that store offsets modulo 256 as 1-byte
offsets with a separate list of where the 256 byte boundaries are. That means
we would probably still need a heap_deform tuple that steps through the list
of offsets, null bitmap, and 256-boundary list, and builds a in-memory list of
pointers to the fields with nulls for the null values.

Actually for the in-memory storage instead of a simple array of pointers I
picture an array of structs where each struct contains a pointer to the field,
a 4-byte length value, and the typmod (for a total of 12 bytes on a 32-bit
platform and 16 bytes on a 64-bit platform). Then instead of passing a Datum
directly to functions we would pass one of these structs (possibly by
reference?).

That means no function would ever have to store the length of a field internal
to the field itself, and it could always depend on having the typmod
available.

In real terms this actually doesn't reduce space usage, in fact it probably
increases it. It's effectively storing a 1-byte header for every field even if
it's something like an integer.

But you could imagine some further refinements. If you store all the
non-varlena fields first then you can just start the list of offsets at the
first varlena. heap_deform tuple could step through the tupldescriptor and
null bitmap first setting up the offsets to any field with an attlen without
the help of any offsets.

(Incidentally, to do that there's a tricky bit. You have to be able to put new
fixed length fields before varlena fields even if they're added later. But
that means you need to know they're missing on the old records so you can't
insert them out of order in the null bitmap. So the null bitmap would have to
be in the order they were added, and even if that's not the order they appear
physically or the order they appear logically. Even more confusing than the
original plan that got shot down as too confusing already.)

The main inefficiency in this is it would mean passing 3x as much data (only
2x as much data on 64-bit platforms) to functions. But given the way functions
are called with a pointer to 100-element arrays it doesn't seem like that
should be a big concern.

The upside is that data types could use different size storage for different
typmods without having to define whole new types based on internal
implementation details the user doesn't know about. NUMERIC could internally
use integer for some typmods and just perform integer arithmetic. It wouldn't
have to reserve any bits to indicate the scale or precision of the number
since the typmod is available whenever it's processing an element from the
table and it's passed around with the field by the system. There's no security
concern because the system is responsible for passing the typmod, not the
caller. Only C code would be able to construct arbitrary typmods; SQL or
Pl/PgSQL would have to use casts to safely set the typmod on their output.

On advantage of passing around this meta information automatically is that it
obviates the need for many core data types whose main advantage is that they
use more efficient storage than is possible with domains or user defined
types. The recently discussed UUID type, for example, offers nothing that
couldn't be implemented with a user-defined type using bytea storage
internally -- except for the fact that it doesn't store an extra 4 bytes of
length header in every value.

Another advantage is that we wouldn't need a whole slew of functions in the
catalogs that do the same thing slightly differently depending on the input
type. Currently each one has to be declared separately since there's no way to
know what type of datum they've been passed.

I suspect something like this will be necessary when it comes to implementing
per-column encoding anyways. Unless we adopt some sort of universal encoding
as the "preferred" database encoding and always convert to and from it, we'll
need to pass around the encoding of any text datum along with the datum too.

Well I'm happier having gotten these ideas down. Even if, as I said, I have no
plans to implement these ideas in the near future. It would mean totally
rewriting some pretty hair internal parts that I'm only starting to have a
basic understanding of.

I am planning to experiment with the shortvarlena form for shortvarchar et al
but I hope that's something we only use for a few versions. I still hope we
move to something like the above someday instead.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-10-03 20:01:20 Re: MSVC build broken (again)
Previous Message Zdenek Kotala 2006-10-03 19:52:52 Re: PG qsort vs. Solaris