RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;

From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;
Date: 2008-06-10 11:45:23
Message-ID: 758d5e7f0806100445j13e4768fy4342554890c14a6c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello!

Currently the TOASTing code does its magic when whole tuple is
larger than TOAST_TUPLE_TARGET which happens to be around 2KB.

There are times though when one is willing to trade using (fast) CPU to
reduce amount of (slow) I/O. A data warehousing types of workload
most notably. Rarely used large columns which are likely to compress
well but are not large enough to trigger inline compression.

As we already have four types of ALTER COLUMN .. SET STORAGE
{ PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add
"COMPRESSED" which would force column compression (if column is
smaller than some minimun, I guess somwehwere between 16 and 32 bytes).

First of all, would such a feature be desirable? [1]

...as for implementation idea, so far I see it more or less like this:
* src/backend/access/common/heaptuple.c:
for tuples with COMPRESSED attributes, we set the infomask bit
HEAP_HASEXTERNAL, so that tuple will trigger TOAST regardless
of size.
* src/backend/access/heap/tuptoaster.c:
- add a bool "need_compress = false;" around line 425.
- while scanning the attributes (lines 472-575), mark the ones which
should be COMPRESSED
- if (need_compress), compress every marked column.
- perhaps refactor inline compression code (639-659) as a static
funcion shared with need_compress part above.

Does this sound reasonable?

PS: as a side note: I wonder if perhaps we could try compression erarlier,
at 1KB or event at 0.5KB, but leave TOASTing at 2KB limit)?

[1]: Actually some time ago I did write a system which stores tons of
real[0:59] (an hour's worth of every minute readings) data. Such column
takes approximately 246 bytes. For fun and experiment I did transform
the data into real[0:23][0:59] storing whole day's data. To my surprise
such column stores between 64 (!) and 5968. Also 66% of values were
taking less than 254 bytes (and 55% < 128 bytes)... And as the data
is much larger than RAM and read randomly, having it shrunk by more
than 25% is tempting. Hence the idea of SET STORAGE COMPRESSED.
I know such schema is flawed by design, but I guess there are other types
of data which would also see benefit from such an option.
--
Solving [site load issues] with [more database replication] is a lot
like solving your own personal problems with heroin - at first it
sorta works, but after a while things just get out of hand.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2008-06-10 12:50:45 Re: Proposal: GiST constraints
Previous Message Teodor Sigaev 2008-06-10 11:38:21 Re: GIN improvements