Re: A space-efficient, user-friendly way to store categorical data

From: Andrew Kane <andrew(at)chartkick(dot)com>
To: Mark Dilger <hornschnorter(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Kane <andrew(at)chartkick(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A space-efficient, user-friendly way to store categorical data
Date: 2018-02-13 03:57:24
Message-ID: CACDdp+YG_NP9isQ9Ey2KPb_sKoRH+MUj51kJzaGm6rTocKTUcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

They'd refer to separate enums.

I originally thought an enum was a good comparison for this feature, but
I'm no longer sure that it is. A text-based ordering would be desired
rather than the label index.

A better comparison may be a two-column lookup table:

-- create
CREATE TABLE cities (id bigserial primary key, name text)
CREATE UNIQUE INDEX ON cities (name);
CREATE TABLE users (city_id bigint);

-- insert
BEGIN;
INSERT INTO cities (name) VALUES ('Chicago') ON CONFLICT (name) DO NOTHING
RETURNING id;
INSERT INTO users (city_id) VALUES (<city id returned from earlier>);
COMMIT;

-- select
SELECT * FROM users FROM users INNER JOIN cities ON cities.id =
users.city_id WHERE name = 'Chicago';

Ideally, the lookup table could be maintained by Postgres to make reads and
writes easier.

-- create
CREATE TABLE users (city text DEDUPED);

-- insert
INSERT INTO users (city) VALUES ('Chicago');

-- query
SELECT * FROM users WHERE city = 'Chicago';

I'm not really sure the best place to store this lookup table.

- Andrew

On Mon, Feb 12, 2018 at 7:11 PM, Mark Dilger <hornschnorter(at)gmail(dot)com>
wrote:

>
> > On Feb 12, 2018, at 6:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Andrew Kane <andrew(at)chartkick(dot)com> writes:
> >> Thanks everyone for the feedback. The current enum implementation
> requires
> >> you to create a new type and add labels outside a transaction prior to
> an
> >> insert.
> >
> > Right ...
> >
> >> Since enums have a fixed number of labels, this type of feature may be
> >> better off as a property you could add to text columns (as Thomas
> >> mentions). This would avoid issues with hitting the max number of
> labels.
> >
> > ... but you're not saying how you'd avoid the need for prior commit of
> the
> > labels. The sticking point for enums is that once a value has gotten
> into
> > a btree index, we can't ever lose the ability to compare that value to
> > others, or the index will be broken. So inserting an uncommitted value
> > into user tables has to be prevented.
> >
> > Maybe there's a way to assign the labels so that they can be compared
> > without reference to any outside data, but it's not clear to me how
> > that would work.
>
> When I implemented this, I wrote the comparators to work on the Oid for
> the value, not the string representation. That works fine. If you want to
> sort the data on the stringified version, cast to text first. That works
> well
> enough for me, since I'm typically not interested in what sort order is
> used,
> as long as it is deterministic and works for indexing, group by, and so
> forth.
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2018-02-13 03:58:22 Re: rename sgml files?
Previous Message Michael Paquier 2018-02-13 03:52:38 Re: rename sgml files?