Reducing data type space usage

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Reducing data type space usage
Date: 2006-09-15 17:50:37
Message-ID: 87mz91ow7m.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Following up on the recent discussion on list about wasted space in data
representations I want to summarise what we found and make some proposals:

As I see it there are two cases:

Case 1) Data types that are variable length but often quite small. This includes
things like NUMERIC which in common use will rarely be larger than 12-20
bytes and often things like text.

In cases like these we really only need 1 or sometimes 2 byte varlena
header overhead, not 4 as we currently do. In fact we *never* need more
than 2 bytes of varlena header on disk anyways with the standard
configuration.

Case 2) Data types that are different sizes depending on the typmod but are always
the same size that can be determined statically for a given typmod. In the
case of a ASCII encoded database CHAR(n) fits this category and in any case
we'll eventually have per-column encoding. NUMERC(a,b) could also be made
to fit this as well.

In cases like these we don't need *any* varlena header. If we could arrange
for the functions to have enough information to know how large the data
must be.

Solutions proposed:

Case 1) We've discussed the variable sized varlena headers and I think it's clear
that that's the most realistic way to approach it.

I don't think any other approaches were even suggested. Tom said he wanted
a second varlena format for numeric that would have 2-byte alignment. But I
think we could always just say that we always use the 2-byte varlena header
on data types with 2-byte alignment and the 4-byte header on data types
with 4-byte alignment needs. Or heap_form_tuple could be even cleverer
about it but I'm not sure it's worth it.

This limits the wasted space to 1-2% for most variable sized data that are
50 bytes long or more. But for very small data such as the quite common
cases where those are often only 1-4 bytes it still means a 25-100%
performance drain.

Case 2) Solving this is quite difficult without introducing major performance
problems or security holes. The one approach we have that's practical right
now is introducing special data types such as the oft-mentioned "char" data
type. "char" doesn't have quite the right semantics to use as a transparent
substitute for CHAR but we could define a CHAR(1) with exactly the right
semantics and substitute it transparently in parser/analyze.c (btw having
two files named analyze.c is pretty annoying). We could do the same with
NUMERIC(a,b) for sufficiently small values of a and b with something like
D'Arcy's CASH data type (which uses an integer internally).

The problem with defining lots of data types is that the number of casts
and cross-data-type comparisons grows quadratically as the number of data
types grows. In theory we would save space by defining a CHAR(n) for
whatever size n the user needs but I can't really see anything other than
CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitute like CASH
(with full NUMERIC semantics though) and maybe a 2-byte and 8-byte
substitute might be reasonable but anything else would be pointless.

I see these two solutions as complementary. The variable varlena headers take
care of the larger data and the special-purpose data types take care of the
extremely small data. And pretty important to cover both cases data that fits
in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1)
flag columns or NUMERIC(10,2) currency columns.

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rocco Altier 2006-09-15 18:03:45 Re: [PATCHES] Linking on AIX (Was: Fix linking of OpenLDAP libraries )
Previous Message Andrew Dunstan 2006-09-15 17:29:36 Re: regression test for uuid datatype