removing limitations from bitmap index scan

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: removing limitations from bitmap index scan
Date: 2023-06-28 07:29:39
Message-ID: CAFBsxsFhMdC8dsYiupad24c952DX0B8K5msTvi7s4sxvTmep4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This email has no patch yet -- it's more of a placeholder to gather some
issues into one place. During previous work on replacing vacuum's bsearch'd
array for TID storage with a radix tree, Andres suggested [1] that the hash
table in tidbitmap.c should also be replaced. This will hopefully
illuminate how to get there.

* Current limitations

Page table entries are of fixed-size. At PGCon, Mark Dilger and Andres
discussed the problem that bitmap scans assume a hard-coded limit on the
offset of a TID, one that's particular to heap AM. That's not a requirement
of hash tables in general, but that's the current state of the
implementation. Using a radix tree would smooth the way to allowing
variable-length page table entries. I have briefly talked about it with
colleagues offlist as well.

The radix tree implementation that Masahiko and I have worked on does still
currently assume fixed-sized values. (not absolutely, but fixed at compile
time for a given use), but last month I did some refactoring that would
make variable-sized values fairly straightforward, at least no big
challenge. There would of course also be some extra complexity in doing TBM
union/intersection operations etc. Recent work I did also went in the
direction of storing small-enough values in the last-level pointer, saving
memory (as well as time spent accessing it). That seems important, since
trees do have some space overhead compared to arrays.

* Iteration/ordering

There are also now some unpleasant consequences that stem from hashed
blocknumbers:
- To get them ready for the executor the entries need to be sorted by
blocknumber, and "random" is a strenuous sorting case, because of cache
misses and branch mispredicts.
- Pages get lossified (when necessary) more-or-less at random

Radix trees maintain logical ordering, allowing for ordered iteration, so
that solves the sorting problem, and should help give a performance boost.

One hurdle is that shared iteration must work so that each worker can have
a disjoint subset of the input. The radix tree does have shared memory
support, but not yet shared iteration since there hasn't been a concrete
use case. Also, DSA has a noticeable performance cost. A good interim
development step is to use a local-mem radix tree for the index scan, and
then move everything out to the current array for the executor, in shmem if
the heap scan will be parallel. (I have coded some steps in that direction,
not ready to share.) That keeps that part of the interface the same,
simplifying testing. It's possible this much would work even for varlen
bitmaps: the iteration array could use a "tall skinny" page table entry
format, like

{ blockno; <metadata>; wordnum; bitmapword; }

...which would save space in many cases. Long term, we will want to move to
shared memory for the radix tree, at least as a prerequisite for parallel
bitmap index scan. The concurrency scheme is likely too coarse to make that
worthwhile now, but that will hopefully change at some point.

* Possible simplification

Some of the above adds complexity, but I see a possible simplification:
Many places in tidbitmap.c need to know if we have a single entry, to keep
from creating the hash table. That was added before simplehash.h existed. I
suspect most of the overhead now in creating the hash table is in zeroing
the backing array (correct me if I'm wrong). The radix tree wouldn't do
that, but it would create about half a dozen memory contexts, and inserting
a single entry would allocate one or two context blocks. Neither of these
are free either. If the single-entry case is still worth optimizing, it
could be pushed down inside inside the radix tree as a template option that
lazily creates memory contexts etc.

* Multiple use

Vacuum concerns started this in the first place, so it'll have to be kept
in mind as we proceed. At the very least, vacuum will need a boolean to
disallow lossifying pages, but the rest should work about the same.

There are some other things left out, like memory management and lossy
entries to work out, but this is enough to give a sense of what's involved.

[1]
https://www.postgresql.org/message-id/20230216164408.bcatntzzxj3jqn3q%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2023-06-28 07:30:54 Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)
Previous Message Richard Guo 2023-06-28 07:09:55 Re: Another incorrect comment for pg_stat_statements