Re: Questions about btree_gin vs btree_gist for low cardinality columns

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: Jeremy Finzel <finzelj(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Questions about btree_gin vs btree_gist for low cardinality columns
Date: 2019-06-01 02:52:29
Message-ID: CAKqnccjrdoNne5D48-REhgNgRghc_2o453x1SWEoa8EYh792EQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Jeremy's question is *great*, and really well presented. I can't answer his
questions, but I am keenly interested in this subject as well. The links he
provides lead to some really interesting and well-though-out pieces, well
worth reading.

I'm going to try restating things in my own way in hopes of getting some
good feedback and a basic question:

*What are the best ways to index low cardinality values in Postgres?*

For an example, imagine an address table with 100M US street addresses with
two character state abbreviations. So, say there are around 60 values in
there (the USPS is the mail system for a variety of US territories,
possessions and friends in the Pacific.) Okay, so what's the best index
type for state abbreviation? For the sake of argument, assume a normal
distribution so something like FM (Federated States of Micronesia) is on a
tail end and CA or NY are a whole lot more common.

A *B-tree* is obviously a pretty *bad match* for this sort of situation. It
works, but B-trees are ideal for *unique* values, and super large for
repeated values. Not getting into the details or Postgres specifics of
various kinds of traditional B-trees. (I think B*?) Doesn't matter. You
have a huge index because the index size is closely related to the number
of *rows*, not the number of *distinct values*.

Alternatively, you could set up *partial indexes* for the distinct values,
like so:

Running 10 Million PostgreSQL Indexes In Production (And Counting)
https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production

Like Jeremy, I've wondered about *GIN indexes* for low-cardinality columns.
Has anyone tried this out in PG 10 or 11? It sounds like a good idea. As I
understand it, GIN indexes are something like a B-tree of unique values
that link to another data structure, like a tree, bitmap, etc. So, in my
imaginary example, there are 60 nodes for the state codes [internally there
would be more for padding free nodes, evenly sized pages, etc....just
pretend there are 60] and then linked to that, 60 data structures with the
actual row references. Somehow.

It can imagine things going quite well with a GIN or btree_gin. I can also
imagine that the secondary data structure could get bloaty, slow, and
horrible. I've worked with a system which uses bitmaps as the secondary
structure (at least some of the time), and it can work quite well...not
sure how it's implemented in Postgres.

So, does anyone have any links or results about efficiently (space and/or
time) indexing Boolean and other low-cardinality columns in Postgres? I'm
on PG 11, but figure many are on 9.x or 10.x.

References and results much appreciated.

Thanks!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom K 2019-06-01 02:53:35 Re: psql: FATAL: the database system is starting up
Previous Message Tom Lane 2019-05-31 21:39:44 Re: compiling PL/pgSQL plugin with C++