Re: A space-efficient, user-friendly way to store categorical data

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Kane <andrew(at)chartkick(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A space-efficient, user-friendly way to store categorical data
Date: 2018-02-12 06:06:08
Message-ID: CAEepm=0-A2hEE6sTYDHtsOQR9ykGnBzACPCNHkLE9-6x7hngAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan
<andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andrew Kane <andrew(at)chartkick(dot)com> writes:
>>> A better option could be a new "dynamic enum" type, which would have
>>> similar storage requirements as an enum, but instead of labels being
>>> declared ahead of time, they would be added as data is inserted.
>>
>> You realize, of course, that it's possible to add labels to an enum type
>> today. (Removing them is another story.)
>>
>> You haven't explained exactly what you have in mind that is going to be
>> able to duplicate the advantages of the current enum implementation
>> without its disadvantages, so it's hard to evaluate this proposal.
>>
>
>
> This sounds rather like the idea I have been tossing around in my head
> for a while, and in sporadic discussions with a few people, for a
> dictionary object. The idea is to have an append-only list of labels
> which would not obey transactional semantics, and would thus help us
> avoid the pitfalls of enums - there wouldn't be any rollback of an
> addition. The use case would be for a jsonb representation which
> would replace object keys with the oid value of the corresponding
> dictionary entry rather like enums now. We could have a per-table
> dictionary which in most typical json use cases would be very small,
> and we know from some experimental data that the compression in space
> used from such a change would often be substantial.
>
> This would have to be modifiable dynamically rather than requiring
> explicit additions to the dictionary, to be of practical use for the
> jsonb case, I believe.
>
> I hadn't thought about this as a sort of super enum that was usable
> directly by users, but it makes sense.
>
> I have no idea how hard or even possible it would be to implement.

I have had thoughts over the years about something similar, but going
the other way and hiding it from the end user. If you could declare a
column to have a special compressed property (independently of the
type) then it could either automatically maintain a dictionary, or at
least build a new dictionary for your when you next run some kind of
COMPRESS operation. There would be no user visible difference except
footprint. In ancient DB2 they had a column property along those
lines called "VALUE COMPRESSION" (they also have a row-level version,
and now they have much more advanced kinds of adaptive compression
that I haven't kept up with). In some ways it'd be a bit like toast
with shared entries, but I haven't seriously looked into how such a
thing might be implemented.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-02-12 06:28:53 Re: proposal: alternative psql commands quit and exit
Previous Message Andres Freund 2018-02-12 04:14:49 Re: Minor version upgrades and extension packaging