Re: kill_prior_tuple and index scan costing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: kill_prior_tuple and index scan costing
Date: 2020-03-22 02:33:02
Message-ID: 20200322023302.ex3qxdbb5sscifix@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

reply largely based on a quick IM conversation between Peter and me.

On 2020-03-04 17:13:33 -0800, Peter Geoghegan wrote:
> Both plans are very similar, really. The number of heap accesses and
> B-Tree index page accesses is exactly the same in each case.

Note that bitmap heap scans, currently, have the huge advantage of being
able to efficiently prefetch heap data. That can be a *huge* performance
boon (I've seen several orders of magnitude on SSDs).

There's also some benefits of bitmap heap scans in other ways. For heap
the "single tid" path index->heap lookup locks the page once for each
tid, whereas bitmap heap scans only do that once - adding more lock
cycles obvious can have a noticable performance impact. Not having
interspersed io between index and heap can be beneficial too.

I thought we had optimized the non-lossy bitmap path for heap
(i.e. heapam_scan_bitmap_next_block()) to perform visibility checks more
efficiently than single tid fetches
(i.e. heapam_index_fetch_tuple()). But both use
heap_hot_search_buffer(), even though the number of page locks differs.

I'm a bit surprised that neither heap_hot_search_buffer() nor the "lossy
path" in heapam_scan_bitmap_next_blocks()'s take advantage of the page's
all-visible? I don't really see a good reason for that. The HTSV calls
do show up noticably in profiles, in my experience.

While your recent btree work ensures that we get the heap tids for an
equality lookup in heap order (right?), I don't think we currently have
the planner infrastructure to know that that's the case (since other
index types don't guarantee that) / take it into account for planning?

> But the index scan plan has one non-obvious advantage, that might
> matter a lot in the real world: it can apply the kill_prior_tuple
> optimization. (It is never possible to use the kill_prior_tuple
> optimization during a bitmap index scan.)

Indeed. I've seen this cause very significant issues a couple
times. Basically whenever the handful of very common queries that
touched most of the data switched to bitmap heap scans, the indexes
would explode in size. Due to the index sizes involved there was no way
normal vacuum could clean up dead tuples quickly enough to prevent
bloat, but with kill_prior_tuple that wasn't a problem.

I have wondered whether we could "just" add some support for
kill_prior_tuple to the bitmap index scan infrastructure. Obviously
that'd require some way of calling "back" into the index code once
(several?) tuples on a page are found to be dead during a bitmap heap
scan. Which would require keeping track of additional metadata for each
tuple in the tid bitmap, which is obviously not free and would have to
be conditional.

I don't really like the kill_prior_tuple interface much. But I don't
immediately see how to do better, without increasing the overhead.

> It makes sense that the planner determines that a bitmap index scan is
> faster -- or it used to make sense. Before commit dd299df8, which made
> heap TID a tiebreaker nbtree index column, we might find ourselves
> accessing the same heap page multiple times, pinning it a second or a
> third time within the executor (it depended on very unstable
> implementation details in the nbtree code). These days we should
> *reliably* access the same number of heap pages (and index pages) with
> either plan. (There are a couple of caveats that I'm glossing over for
> now, like pg_upgrade'd indexes.)

Leaving the locking difference pointed out above aside, there still is a
significant difference in how many times we indirectly call into the
index AM, and how much setup work has to be done though?

There's at least one index_getnext_tid() call for each result tuple,
which each time indirectly has to call btgettuple(). And each
btgettuple() has to do checks (do array keys need to be advanced, has
_bt_first() been called). Whereas btgetbitmap() can amortize across
all tuples.

And that's without considering the fact that, to me, btgetbitmap() could
be significantly optimized by adding multiple tuples to the bitmap at
once, rather than doing so one-by-one.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-22 03:23:03 Re: BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan
Previous Message Justin Pryzby 2020-03-22 02:18:01 doc review for parallel vacuum