Re: Why B-Tree suffix truncation matters

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why B-Tree suffix truncation matters
Date: 2018-07-06 00:26:11
Message-ID: CAH2-WznSin8B_aWSZHyf3=NAX_6bReGpGw++EGkvv9obRP+7Ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 5, 2018 at 7:17 AM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> Some years ago I've observed viable performance improvement (around 8% in inserts and 5% in selects) of covering indexes in a series of experiments [0].

Covering indexes are definitely a closely related idea. It kind of
makes sense that we got covering indexes first, even though I think
most other DB systems had suffix truncation quite early on.

Suffix truncation (and being smarter about the point at which pages
are split, both in general, and to help suffix truncation) is
important for bloat. Index bloat and LP_DEAD killing and heap bloat
and heap pruning seem to be related, and to benefit from better
locality of access that the TID key thing adds. The same patch that
helped a lot with the composite index seems to help an awful lot with
pgbench workloads that are I/O bound. and take hours to run -- I saw
30%+ increases in transaction throughput on a benchmark that just
finished up, that was running for almost 24 hours. My latest work
seems very promising for OLTP workloads that we've traditionally not
done so great with.

I hope to be able to post a new version of my patch in the next few
days, or maybe next week. I just need to do a bit more validation, and
clean-up. V3 will be very different to V2 and V1, since it won't just
be an idea for a new architectural direction. It addresses several
problems all at once.

This work seems to have a kind of "inventor's paradox" quality to it
-- it's somehow easier to fix all the problems at once than it is to
fix just one, which is very counter-intuitive.

> Indeed, but prefix truncation can reduce whole index size dramatically, not by a matter of few percents.

I made the example composite index about 15% smaller after random
insertions, and I did so with code that mostly just pays close
attention to the existing rules. It saves us the cost of many
comparisons, and it has benefits for I/O and for bloat control. The
optimization has no downside that I'm aware of. That's why I'm
starting there.

Look at it like this: It's already true that pivot tuples may have
values that were from non-pivot tuples that were killed long ago,
since the job of pivot tuples is to simply separate the key space; the
fact that a particular index tuple happened to have the exact values
used for a pivot makes no difference to that index tuple. VACUUM may
have removed the index tuple a year ago, without caring about the
pivot tuple. As far as anybody can tell, it could be a fictitious
value that never existed in the first place. Why not *actually* start
with a truly fictitious value for pivot tuples when we're doing leaf
page splits? It's obvious that it's safe. We just need to be sure that
it separates values on each side of the original split, and be a bit
smarter about where on the page to take as a split point. Everything
else automatically follows.

Right now, the suffix truncation is not very aggressive, since it just
truncates whole attributes. Some day, we should be able to truncate up
to a single character in a text string. I think we need key
normalization for that, though.

> If we have foreign key from table 1 with 1M tuples to table 2 with 1K rows, index size can be reduced by the order of magnitude. And this optimization seems very straightforward: trim tuple's prefix, if it matches hikey's prefix (or some other's in case of leaf page).

Prefix compression is way more complicated than you seem to think.

With key normalization, it's fairly easy to do it for pivot tuples in
internal pages, because the normalized key representation is easy to
reason about during page splits -- we need pay very close attention to
the use of space. But at the leaf level, we have to use the original
representation (for index-only scans), and so the space management is
hard. You really need to add a "low key" on the leaf, because we need
to think about the universe of possible values that can go on the
page, not the values that happen to currently be present. Otherwise
the free space management in a nightmare.

In MySQL, prefix truncation is configurable, and is often not used.
I'd expect to have to make it configurable. That's a very different
kind of feature.

> I see code complexity as somewhat a downside. B-tree is kind of a rocket science.

I know what you mean, but FWIW I see it as a matter of developing a
good mental model. If you look at the code without a mental model, the
complexity is overwhelming. If you develop a high-level understanding
first, and try to visualize the structure, then eventually it starts
to make sense, and it seems way less complicated (though still very
complicated).

> Chances are you have some kind of roadmap of B-tree things to implement in nearest years?

Kind of. That's almost what this page is:
https://wiki.postgresql.org/wiki/Key_normalization

That page is really just about the representation of pivot tuples and
the representation of individual pages, and yet that's almost the only
area that I'd like to make changes to. I have a very high opinion of
the nbtree code. I'm not really interested in changing the high-level
rules. For example, I'm not interested in adding merging of pages that
are not yet completely empty. It's way too much complexity for way too
little gain. The same problem can actually be more effectively
addressed by suffix truncation -- why not just make it so that those
pages don't become half empty to begin with?

I could perhaps work on prefix compression. But probably not in the
next 3 years. It's way down the list for me.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ideriha, Takeshi 2018-07-06 01:30:18 RE: Global shared meta cache
Previous Message Michael Paquier 2018-07-06 00:21:41 Re: Copy function for logical replication slots