Re: An inverted index using roaring bitmaps

From: Chinmay Kanchi <cgkanchi(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: An inverted index using roaring bitmaps
Date: 2022-06-08 00:00:45
Message-ID: CAJqPDh8cfA9JGy52moQzAJF2PQrj2wGKgL2uq2nfex5AvPmidw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I personally don't think this is a great replacement for a BTree index -
for one thing, it isn't really possible to use this approach beyond
equality comparisons (for scalars) or "contains"-type operations for arrays
(or tsvectors, jsonb, etc). I see this more as "competing" with GIN, though
I think GIN solves a different use-case. The primary thought here is that
we could build lightning fast inverted indexes for the cases where these
really help.

I played with using roaring bitmaps in production to build rollup tables,
for instance - where a single bitmap per key could satisfy count() queries
and count(*) ... GROUP BY with multiple WHERE conditions way faster than
even an index-only scan could, and without the overhead of multi-column
indexes. In our particular case, there were about 2 dozen columns with
around 30-40 million rows, and we were able to run these queries in
single-digit milliseconds. We ultimately abandoned that project because of
the difficulty of keeping the bitmaps in sync with changing data, which
would no longer be an issue, if this was built as an index.

I think your point about data warehouse-style bitmap indexes hits the nail
on the head here. This would be pretty much just that, a very efficient way
to accelerate such queries.

Cheers,
Chinmay

On Tue, Jun 7, 2022 at 11:53 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Mon, Jun 6, 2022 at 10:42 PM Chinmay Kanchi <cgkanchi(at)gmail(dot)com> wrote:
> > The simulated index in this case is outrageously fast, up to ~150x on
> the GROUP BY.
>
> Couldn't you make a similar argument in favor of adding a B-Tree index
> on "country"? This probably won't be effective in practice, but the
> reasons for this have little to do with how a B-Tree index represents
> TIDs. A GIN index can compress TIDs much more effectively, but the
> same issues apply there.
>
> The main reason why it won't work with a B-Tree is that indexes in
> Postgres are not transactionally consistent structures, in general.
> Whereas your cities_rb table is transactionally consistent (or perhaps
> just simulates a transactionally consistent index). Maybe it could
> work in cases where an index-only scan could be used, which is roughly
> comparable to having a transactionally consistent index. But that
> depends on having the visibility map set most or all heap pages
> all-visible.
>
> GIN indexes don't support index-only scans, and I don't see that
> changing. So it's possible that just adding TID compression to B-Tree
> indexes would significantly speedup this kind of query, just by making
> low cardinality indexes much smaller. Though that's a hard project,
> for many subtle reasons. This really amounts to building a bitmap
> index, of the kind that are typically used for data warehousing, which
> is something that has been discussed plenty of times on this list. GIN
> indexes were really built for things like full-text search, not for
> data warehousing.
>
> B-Tree deduplication makes B-Tree indexes a lot smaller, but it tends
> to "saturate" at about 3.5x smaller (relative to the same index with
> deduplication disabled) once there are about 10 or so distinct keys
> per row (the exception is indexes that happen to have huge keys, which
> aren't very interesting). There are many B-Tree indexes (with typical
> sized keys) that are similar in size to an "equivalent" GIN index --
> the ability to compress TIDs isn't very valuable when you don't have
> that many TIDs per key anyway. It's different when you have many TIDs
> per key, of course. GIN indexes alone don't "saturate" at the same
> point -- there is often a big size difference between low cardinality
> and ultra low cardinality data. There are bound to be cases where not
> having that level of space efficiency matters, especially with B-Tree
> index-only scans that scan a significant fraction of the entire index,
> or even the entire index.
>
> --
> Peter Geoghegan
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2022-06-08 00:09:18 Re: How about a psql backslash command to show GUCs?
Previous Message Michael Paquier 2022-06-07 23:59:10 Re: broken regress tests on fedora 36