Re: B-tree cache prefetches

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-tree cache prefetches
Date: 2019-04-10 22:14:29
Message-ID: CA+hUKG+LPbHaN9wdQTZATD4cS5-oDeikio6gG5=YEKCGhxT4zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> написал(а):
> > [1] https://arxiv.org/pdf/1509.05053.pdf
>
> I've implemented all of the strategies used in that paper.
> On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at the end are mostly "appended", i.e. they usually are accommodated at the end of the free space.
> The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM and can be easily broken with single insert.
> Strategies are switched by #define:
> 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to work similar to baseline (no changes to code at all)
> 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a sequential read of bytes.
> 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that you cache-miss when read mid, but tuples of both next steps are already fetched by that cache miss.
> 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache prefetch.
>
> Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.
>
> I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1)
> ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
> No changes in configs.
>
> Here are results in tps:
> without prefetch
> baseline 1448.331199
> seq 1433.737204
> bt 1463.701527
> veb 1457.586089
> eyz 1483.654765
>
> with prefetch
> baseline 1486.585895
> seq 1471.757042
> bt 1480.169967
> veb 1464.834678
> eyz 1460.323392
>
> This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned anyway.
> From this I can conclude:
> 1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative B-tree performance tests?
> 2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting 2.6%
> 3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the best.
> 4. Among layout strategies with prefetch, bt-order strategy seems to be winner.
>
> How can I improve experimental evaluation of these layout strategies?
> Other thoughts? I'd be happy to hear any comment on this.

What about read-only tests with prewarmed indexes? Those numbers look IO bound.

This is focusing on prefetch within btree code, but I wonder if there
is some lower hanging fruit that doesn't require any layout changes:
heap prefetch during index scans. By analogy to figure 8a
"software-pipelined prefetching" of [1], I wonder if you could build a
pipeline using dirty (unlocked) memory reads, like this:

1. Prefetch buffer mapping hash bucket for heap_tids[n + 3]
2. Prefetch page header for heap_tids[n + 2] using a dirty read of
the first value in the buffer mapping hash bucket
3. Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page
4. Return heap tid for tids[0] to caller

If you're lucky, by the time the caller uses the tid to the find the
buffer, read the line pointer and access the tuple, all three things
will be in cache and hopefully won't have changed since the 'dirty'
reads. Prefetching the wrong cachelines would be possible obviously,
if there is a buffer mapping hash table collision and the one you
wanted isn't first, or stuff just changed. Maybe there aren't enough
cachelines for this 3-level pipeline to work, considering that in
between random other executor nodes could run, so perhaps just a
2-level or 1-level pipeline would pay off. As I showed in my toy hash
join prefetch experiment[2], I could get 20-30% speedup from
prefetching just hash buckets (equivalent to prefetching buffer
mapping hash buckets), and then a further 5-10% speedup from
prefetching the first item in the hash bucket, but I expect nested
hash joins to interfere with that as they compete for cache lines. I
suppose I could try to do three levels and fetch the next tuple in the
chain, but that seems unlikely to help because we're getting into
lower probabilities that I'd even even want that data; in contrast,
the index scan -> heap scan case *definitely* needs to go through a
line pointer to a tuple somewhere on the page so 3-level might be
worth experimenting with. Just a though, I have not tried any of
this. Maybe there are really 4 levels, if you count the buffer
descriptor/lock.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf
[2] https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-04-10 22:35:15 Reducing the runtime of the core regression tests
Previous Message Alvaro Herrera 2019-04-10 22:11:21 Re: pg_dump is broken for partition tablespaces