Re: An inverted index using roaring bitmaps

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Chinmay Kanchi <cgkanchi(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: An inverted index using roaring bitmaps
Date: 2022-06-08 01:13:13
Message-ID: CAH2-Wz=s3JAs=Ehqm7i8Q_K5vJDdn72yFNetrG55RS1GouPb7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 7, 2022 at 5:00 PM Chinmay Kanchi <cgkanchi(at)gmail(dot)com> wrote:
> I personally don't think this is a great replacement for a BTree index - for one thing, it isn't really possible to use this approach beyond equality comparisons (for scalars) or "contains"-type operations for arrays (or tsvectors, jsonb, etc).

Why not? A bitmap is just a way of representing TIDs, that is often
very space efficient once compression is applied. In principle a
bitmap index can do anything that a B-Tree index can do, at least for
SELECTs.

Bitmap indexes are considered totally distinct to B-Tree indexes in
some DB systems because the concurrency control characteristics (the
use of heavyweight locks to protect the logical contents of the
database) are very different . I think that this is because the index
structure itself is so dense that the only practical approach that's
compatible with 2PL style concurrency control (or versions to MVCC
based on 2PL) is to lock a large number of TIDs at the same time. This
can lead to deadlocks with even light concurrent modifications --
which would never happen with an equivalent B-Tree index. But the data
structure is nevertheless more similar than different.

I probably wouldn't want to have a technique like roaring bitmap
compression of TIDs get applied by default within Postgres B-Trees,
but the reasons for that are pretty subtle. I might still advocate
*optional* TID list compression in Postgres B-Trees, which might even
be something we'd end up calling a bitmap index, that would only be
recommended for use in data warehousing scenarios. Extreme TID list
compression isn't free -- it really isn't desirable when there are
many concurrent modifications to relatively few index pages, as is
common in OLTP applications. That's one important reason why bitmap
indexes are generally only used in data warehousing environments,
where the downside doesn't really matter, but the upside pays for
itself (usually a fact table will have several bitmap indexes that are
usually combined, not just one).

> We ultimately abandoned that project because of the difficulty of keeping the bitmaps in sync with changing data, which would no longer be an issue, if this was built as an index.

I think that it would in fact still be an issue if this was built as
an index. There is a reason why the concurrency characteristics of
bitmap indexes make them unsuitable for OLTP apps, which seems
related. That wouldn't mean that it wouldn't still be worth it, but it
would definitely be a real downside with some workloads. B-Tree
deduplication is designed to have very little overhead with mixed
reads and writes, so it's a performance all-rounder that can still be
beaten by specialized techniques that come with their own downsides.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-06-08 01:55:29 Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch
Previous Message Robert Haas 2022-06-08 00:45:15 Re: [PATCH] Expose port->authn_id to extensions and triggers