Re: An inverted index using roaring bitmaps

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Chinmay Kanchi <cgkanchi(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: An inverted index using roaring bitmaps
Date: 2022-06-07 18:53:28
Message-ID: CAH2-Wz=PfzyWW7PYcB8JmZ0A=sN8uPf4Aspw076=j5WRsoQcMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 6, 2022 at 10:42 PM Chinmay Kanchi <cgkanchi(at)gmail(dot)com> wrote:
> The simulated index in this case is outrageously fast, up to ~150x on the GROUP BY.

Couldn't you make a similar argument in favor of adding a B-Tree index
on "country"? This probably won't be effective in practice, but the
reasons for this have little to do with how a B-Tree index represents
TIDs. A GIN index can compress TIDs much more effectively, but the
same issues apply there.

The main reason why it won't work with a B-Tree is that indexes in
Postgres are not transactionally consistent structures, in general.
Whereas your cities_rb table is transactionally consistent (or perhaps
just simulates a transactionally consistent index). Maybe it could
work in cases where an index-only scan could be used, which is roughly
comparable to having a transactionally consistent index. But that
depends on having the visibility map set most or all heap pages
all-visible.

GIN indexes don't support index-only scans, and I don't see that
changing. So it's possible that just adding TID compression to B-Tree
indexes would significantly speedup this kind of query, just by making
low cardinality indexes much smaller. Though that's a hard project,
for many subtle reasons. This really amounts to building a bitmap
index, of the kind that are typically used for data warehousing, which
is something that has been discussed plenty of times on this list. GIN
indexes were really built for things like full-text search, not for
data warehousing.

B-Tree deduplication makes B-Tree indexes a lot smaller, but it tends
to "saturate" at about 3.5x smaller (relative to the same index with
deduplication disabled) once there are about 10 or so distinct keys
per row (the exception is indexes that happen to have huge keys, which
aren't very interesting). There are many B-Tree indexes (with typical
sized keys) that are similar in size to an "equivalent" GIN index --
the ability to compress TIDs isn't very valuable when you don't have
that many TIDs per key anyway. It's different when you have many TIDs
per key, of course. GIN indexes alone don't "saturate" at the same
point -- there is often a big size difference between low cardinality
and ultra low cardinality data. There are bound to be cases where not
having that level of space efficiency matters, especially with B-Tree
index-only scans that scan a significant fraction of the entire index,
or even the entire index.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kaiting Chen 2022-06-07 18:59:24 Allow foreign keys to reference a superset of unique columns
Previous Message Tomas Vondra 2022-06-07 17:47:11 Re: pgcon unconference / impact of block size on performance