Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-09-12 18:11:18
Message-ID: CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
--
Peter Geoghegan

Attachment Content-Type Size
v4-0001-Make-all-nbtree-index-tuples-have-unique-keys.patch application/octet-stream 130.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-09-12 18:14:15 Re: Performance improvements for src/port/snprintf.c
Previous Message Tom Lane 2018-09-12 17:40:01 Re: Allowing printf("%m") only where it actually works