Variable length array element encoding…

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Variable length array element encoding…
Date: 2012-11-13 19:57:55
Message-ID: 0938071E-E4B8-4466-BDD5-AEB2FAB40397@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[ Not subscribed, please keep me in the CC list ]

Is there a standard idiom for encoding small variable length data in an array? I wrote the varint extension[1] that encodes data using a variable width encoding scheme[2] for signed and unsigned integers[3]. Right now the extension is mostly of use in skinny tables that have at least 4-5 columns, all of which are of INT or INT8. If you have only 5 columns of INT8, you can save ~50% of your table space.

But, to get larger savings, it's required to bypass the tuple overhead and aggregating data in to an array (i.e. aggregate all time series data for a 5min window of time in to a single varuint[]).

The problem with that being, each varint takes 8 bytes in an array because of padding and alignment. Is there a way to prevent that, or, more realistically, are there standard ways of encoding this data in to a BYTEA and then manually scanning and unpacking the data? Random access in to the array isn't a concern. I was thinking about adding a BYTEA to varint[] cast, but am fishing for a better idea.

Any hints or thoughts? Thanks in advance. -sc

[1] https://github.com/sean-/postgresql-varint

[2] SELECT varint64, pg_column_size(varint64) FROM varint64_table ORDER BY varint64 ASC;
varint64 | pg_column_size
----------------------+-----------------
-4611686018427387905 | 11
-4611686018427387904 | 10
-36028797018963969 | 10
-36028797018963968 | 9
-281474976710657 | 9
-281474976710656 | 8
-2199023255553 | 8
-2199023255552 | 7
-17179869185 | 7
-17179869184 | 6
-134217729 | 6
-134217728 | 5
-1048577 | 5
-1048576 | 4
-8193 | 4
-8192 | 3
-65 | 3
-64 | 2
-1 | 2
0 | 2
1 | 2
63 | 2
64 | 3
8191 | 3
8192 | 4
1048575 | 4
1048576 | 5
134217727 | 5
134217728 | 6
17179869183 | 6
17179869184 | 7
2199023255551 | 7
2199023255552 | 8
281474976710655 | 8
281474976710656 | 9
36028797018963967 | 9
36028797018963968 | 10
4611686018427387903 | 10
4611686018427387904 | 11
(39 rows)

SELECT varuint64, pg_column_size(varint64) FROM varuint64_table ORDER BY varint64 ASC;
varuint64 | pg_column_size
---------------------+-----------------
0 | 2
127 | 2
128 | 3
16383 | 3
16384 | 4
2097151 | 4
2097152 | 5
268435455 | 5
268435456 | 6
34359738367 | 6
34359738368 | 7
4398046511103 | 7
4398046511104 | 8
562949953421311 | 8
562949953421312 | 9
72057594037927935 | 9
72057594037927936 | 10
9223372036854775807 | 10

[3] I know the unsigned int only goes up to 2^^63 atm, it will go to 2^^64 once I get around to setting up a test methodology. Using INT8 internally was too convenient at the time.

--
Sean Chittenden
sean(at)chittenden(dot)org

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-11-13 20:11:02 Re: Memory leaks in record_out and record_send
Previous Message Simon Riggs 2012-11-13 19:45:14 Re: Proof of concept: standalone backend with full FE/BE protocol