Re: A space-efficient, user-friendly way to store categorical data

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Andrew Kane <andrew(at)chartkick(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: A space-efficient, user-friendly way to store categorical data
Date: 2018-02-13 00:06:47
Message-ID: 90D6F935-C507-4146-A72D-8408CD06999D@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Feb 10, 2018, at 7:46 PM, Andrew Kane <andrew(at)chartkick(dot)com> wrote:
>
> Hi,
>
> I'm hoping to get feedback on an idea for a new data type to allow for efficient storage of text values while keeping reads and writes user-friendly. Suppose you want to store categorical data like current city for users. There will be a long list of cities, and many users will have the same city. Some options are:
>
> - Use a text column
> - Use an enum column - saves space, but labels must be set ahead of time
> - Create another table for cities (normalize) - saves space, but complicates reads and writes
>
> A better option could be a new "dynamic enum" type, which would have similar storage requirements as an enum, but instead of labels being declared ahead of time, they would be added as data is inserted.
>
> It'd be great to hear what others think of this (or if I'm missing something). Another direction could be to deduplicate values for TOAST-able data types.
>

I have already done this and use it in production. My implementation is
specific to my needs, and would have to be made more general purpose if
it were ever to be accepted by this community, but I'll lay out the
basic design for your consideration.

I maintain multiple mappings of text to/from Oid.

Some of the names, typically the most common ones that I am likely to
encounter, are known at compile time. I have a quick hardcoded lookup
that maps these names to their compile time assigned Oid. Conversions
from the name to Oid or visa versa happen without entailing any shared
lock overhead.

For each mapping, I maintain a separate catalog table. The compile time
known values are inserted into the catalog with DATA lines, per the
usual mechanism for populating catalog tables during bootstrap.

For now, keeping the functions that map Oids to and from compile time
known strings synchronized with the DATA lines is a manual PITA process,
except that I almost never need to extend the mappings, and as such have
not yet bothered to write a perl script for managing it.

For each mapping, I also have a cache. See syscache.h. I have
functions defined that convert between names and Oids that check the
hardcoded values, then the cache, then the full catalog table if not
found in the cache. It all works transactionally and plays well with
other processes that are adding or looking up values simultaneously.
There are more than just 'get' and 'set' functions defined, because
sometimes when you 'get' a value, you want it to be assigned an entry if
it does not exist yet, and sometimes you don't.

For a few of the mappings, where I know the list will never get too
large, I map from text to one or two bytes rather than to a full four
byte Oid. This helps, because I have other tables with multiple columns
that are mappings of this sort, such that I can pack two or three
mapping columns into just 4 bytes. Whether you'd get a similar payoff
depends a lot on your table structure. Since I'm using syscache, I have
to store a full four bytes there, but the tables that reference that
don't have to spend a full four bytes. To make the conversion easier, I
have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
range above 10000 that is reserved for their use, with a mapping to/from
the lower range [0..255] or [0..65535] automatically performed when
casting to/from Oid.

The main reason I prefer this over the built-in enum type is that for
the data distribution I am handling, I have greater than 90% of the
lookups succeed without looking beyond the compile-time known values.
(For some mappings, it's really more like 100%, with the possibility
of run-time added values being a theoretical possibility that I don't
want to exclude, but as yet has never happened in practice.)

The primary drawback of my implementation from the point of view of
making it general purpose is that you need to know something about your
data distribution before you compile postgres. AFAIK, most postgres
users don't compile their own copy, but use an RPM or some such, and
that puts my type of solution out of reach for them.

The second most obvious problem with my implementation is that it lacks
any means for removing values from the mappings. Once added, they are
never removed. For me, this is a good thing, because it means I don't
need to create indexes referencing a master table of mappings to
guarantee data integrity, and for my workload, none of the mappings are
the kind of thing where mappings data would ever get expired.

I am in a really small niche situation that would likely not be
applicable for most other users.

Extending this to help the community as a whole in the form of something
general purpose seems difficult, which is why I never tried to
contribute it.

Could you perhaps tell me how similar your situation is to mine, and
whether my approach is anything like what you were contemplating? I am
curious to hear an answer to Tom's question, upthread, about why you
would not just use the enumeration type that postgres already supplies.

Mark Dilger

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-02-13 00:08:50 Re: Using scalar function as set-returning: bug or feature?
Previous Message Peter Eisentraut 2018-02-12 23:55:11 Re: In logical replication concurrent update of partition key creates a duplicate record on standby.