Re: A space-efficient, user-friendly way to store categorical data

From: Andrew Kane <andrew(at)chartkick(dot)com>
To: Mark Dilger <hornschnorter(at)gmail(dot)com>
Cc: Andrew Kane <andrew(at)chartkick(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: A space-efficient, user-friendly way to store categorical data
Date: 2018-02-13 01:08:09
Message-ID: CACDdp+YdkJCJR+niouXwxCjQgqXmzY0wfjPm_ZvaPPHf1uNopA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks everyone for the feedback. The current enum implementation requires
you to create a new type and add labels outside a transaction prior to an
insert.

-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);

-- on insert
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

What would be ideal:

-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);

-- on insert
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of labels.

- Andrew

On Mon, Feb 12, 2018 at 4:06 PM, Mark Dilger <hornschnorter(at)gmail(dot)com>
wrote:

>
> > On Feb 10, 2018, at 7:46 PM, Andrew Kane <andrew(at)chartkick(dot)com> wrote:
> >
> > Hi,
> >
> > I'm hoping to get feedback on an idea for a new data type to allow for
> efficient storage of text values while keeping reads and writes
> user-friendly. Suppose you want to store categorical data like current city
> for users. There will be a long list of cities, and many users will have
> the same city. Some options are:
> >
> > - Use a text column
> > - Use an enum column - saves space, but labels must be set ahead of time
> > - Create another table for cities (normalize) - saves space, but
> complicates reads and writes
> >
> > A better option could be a new "dynamic enum" type, which would have
> similar storage requirements as an enum, but instead of labels being
> declared ahead of time, they would be added as data is inserted.
> >
> > It'd be great to hear what others think of this (or if I'm missing
> something). Another direction could be to deduplicate values for TOAST-able
> data types.
> >
>
> I have already done this and use it in production. My implementation is
> specific to my needs, and would have to be made more general purpose if
> it were ever to be accepted by this community, but I'll lay out the
> basic design for your consideration.
>
> I maintain multiple mappings of text to/from Oid.
>
> Some of the names, typically the most common ones that I am likely to
> encounter, are known at compile time. I have a quick hardcoded lookup
> that maps these names to their compile time assigned Oid. Conversions
> from the name to Oid or visa versa happen without entailing any shared
> lock overhead.
>
> For each mapping, I maintain a separate catalog table. The compile time
> known values are inserted into the catalog with DATA lines, per the
> usual mechanism for populating catalog tables during bootstrap.
>
> For now, keeping the functions that map Oids to and from compile time
> known strings synchronized with the DATA lines is a manual PITA process,
> except that I almost never need to extend the mappings, and as such have
> not yet bothered to write a perl script for managing it.
>
> For each mapping, I also have a cache. See syscache.h. I have
> functions defined that convert between names and Oids that check the
> hardcoded values, then the cache, then the full catalog table if not
> found in the cache. It all works transactionally and plays well with
> other processes that are adding or looking up values simultaneously.
> There are more than just 'get' and 'set' functions defined, because
> sometimes when you 'get' a value, you want it to be assigned an entry if
> it does not exist yet, and sometimes you don't.
>
> For a few of the mappings, where I know the list will never get too
> large, I map from text to one or two bytes rather than to a full four
> byte Oid. This helps, because I have other tables with multiple columns
> that are mappings of this sort, such that I can pack two or three
> mapping columns into just 4 bytes. Whether you'd get a similar payoff
> depends a lot on your table structure. Since I'm using syscache, I have
> to store a full four bytes there, but the tables that reference that
> don't have to spend a full four bytes. To make the conversion easier, I
> have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
> range above 10000 that is reserved for their use, with a mapping to/from
> the lower range [0..255] or [0..65535] automatically performed when
> casting to/from Oid.
>
> The main reason I prefer this over the built-in enum type is that for
> the data distribution I am handling, I have greater than 90% of the
> lookups succeed without looking beyond the compile-time known values.
> (For some mappings, it's really more like 100%, with the possibility
> of run-time added values being a theoretical possibility that I don't
> want to exclude, but as yet has never happened in practice.)
>
> The primary drawback of my implementation from the point of view of
> making it general purpose is that you need to know something about your
> data distribution before you compile postgres. AFAIK, most postgres
> users don't compile their own copy, but use an RPM or some such, and
> that puts my type of solution out of reach for them.
>
> The second most obvious problem with my implementation is that it lacks
> any means for removing values from the mappings. Once added, they are
> never removed. For me, this is a good thing, because it means I don't
> need to create indexes referencing a master table of mappings to
> guarantee data integrity, and for my workload, none of the mappings are
> the kind of thing where mappings data would ever get expired.
>
> I am in a really small niche situation that would likely not be
> applicable for most other users.
>
> Extending this to help the community as a whole in the form of something
> general purpose seems difficult, which is why I never tried to
> contribute it.
>
> Could you perhaps tell me how similar your situation is to mine, and
> whether my approach is anything like what you were contemplating? I am
> curious to hear an answer to Tom's question, upthread, about why you
> would not just use the enumeration type that postgres already supplies.
>
> Mark Dilger
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-02-13 01:12:40 Re: non-bulk inserts and tuple routing
Previous Message Tom Lane 2018-02-13 00:08:50 Re: Using scalar function as set-returning: bug or feature?