Dynamic prefix truncation for nbtree, Lanin & Shasha design issue

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Dynamic prefix truncation for nbtree, Lanin & Shasha design issue
Date: 2018-12-04 03:11:32
Message-ID: CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is a prototype patch for dynamic prefix truncation that
applies on top of v9 of the nbtree patch series [1] I've been working
on. This results in considerable performance improvements in some
cases, since it's (almost!) safe to skip attributes that we know must
be redundant/equal based on our descent of the B-Tree so far -- we can
reason about that without any extra comparison work. There are
problems, though, which hint at an underlying issue with the broader
nbtree design, which is what I really want to talk about. This thread
may not go anywhere, but I feel an obligation to put my thoughts on
how we do the Lanin & Shasha deletion stuff on the record, available
from the archives. I think that there may be some wisdom in the L&S
design that was overlooked.

First, I will say a bit more about the merits of the dynamic prefix
truncation patch, to establish the practical relevance of what I'm
about to go into. I'll show an improvement in bulk insertion
performance for the UK land registry data [2]. There is one entry in
the table for every property sale in the UK going back 2 or 3 decades
-- that's about 21.5 million rows. I have an index defined against a
3GB table, where most cycles are spent during insertions:

pg(at)land:5432 [25978]=# \d composite
Unlogged index "public.composite"
Column │ Type │ Key? │ Definition
──────────┼──────┼──────┼────────────
county │ text │ yes │ county
city │ text │ yes │ city
locality │ text │ yes │ locality
btree, for table "public.land2"

The total index size is 1101 MB after a CREATE INDEX. As you'd
imagine, this is a medium to low cardinality index, because there are
naturally quite a lot of individual property sales in each locality,
especially those in and around London. The query "INSERT INTO land2
SELECT * FROM land_registry_price_paid_uk" does a bulk insertion into
the unlogged table land2. Here is the impact on throughput/overall
duration:

Duration on master branch (best of 3): 01:17.165

Duration with just v9 of my patch (best of 3): 01:12.736

Duration with v9 + dynamic prefix truncation (best of 3): 01:04.038

Clearly dynamic prefix truncation adds quite a lot here -- a serial
REINDEX takes about 45 - 50 seconds on the same machine. Note that I'm
not cherry-picking my test-case; I've seen similar improvements while
bulk loading a TPC-E database in a similar manner. Clearly this is a
pretty good optimization, that complements the suffix truncation stuff
well. Also, the skip scan patch could make good use of this, since
that repeatedly descends a tree with a low cardinality leading
attribute. The optimization would be particularly helpful there.

Problem
=======

I don't think this latest experimental patch can go anywhere for the
foreseeable future, even though the idea is workable. There is an
_incredibly_ subtle race condition. I want to talk about why this is,
since it seems like it's down to a design decision, which likely has
broad implications:

"""
(Note: Lanin and Shasha prefer to make the key space move left, but their
argument for doing so hinges on not having left-links, which we have
anyway. So we simplify the algorithm by moving key space right.)
"""

(I'm quoting the nbtree README.)

IOW, nbtree page deletion is not the precise opposite of a page split,
unlike an implementation that sticks closely to what Lanin & Shasha
describe: a concurrent would-be insertion into a deleted page ends up
inserting into the deletion target's right sibling, rather than the
deletion target's left sibling. This nbtree README sentence seems
misleading to me. While the README is technically correct to say L&S
don't have left links, they do have something called "outlinks", which
seem like a special case of left link to me [3].

Comparing their "procedure move-right()" pseudo-code on page 16 to our
own _bt_moveright() routine is informative: you see that they
distinguish between an ignorable/half-dead page and a page that just
had a concurrent page split in a way that we clearly don't. We only
ever 1) stay put, or 2) move right on concurrent split or concurrent
deletion. They either 1) stay put, 2) move right when the high key is
less than the value being searched for, or 3) follow the "outlink"
when the page was marked ignorable (half-dead or dead) by a concurrent
deletion.

I think that the nbtree README means that nbtree does what you see in
L&S's "Figure 3. (b)", despite the fact that Lanin & Shasha find it
"inconvenient". L&S also say of Fig 3. (b):

"""
Let us consider various implementations of merging a node n with its
right neighbor n' in the B-link structure. If we move the data in n'
to n (Fig. 3a), there is no path that processes can follow from n' to
the data moved out. This may, for example, lead a process to conclude
that c is not in the structure.' If we move the data in n to n' (Fig.
3b), we will need to access the left neighbor of n to remove n from
the linked list. Finding the left neighbor requires either much time
or a doubly linked list, which increases complexity and the locking
overhead (see [Sh84] for example).
"""

To be fair, whether or not the outlink is a special case of a leftlink
as I suggest, or a separate thing is perhaps debatable -- I'll need to
think about backwards scans some more to develop an opinion on that.
Either way, having an extra link to handle concurrent page deletions
makes a page deletion truly the opposite of a page split. Following a
downlink becomes much closer to following a right link, in the sense
that we can reason about the next page down/to the right having no key
space overlap with the current/about-to-be-previous page.

Advantages of making our design closer to Lanin & Shasha's in this way
may include:

* More elegant design -- moving right to find something that should
actually be to the left is arguably pretty ugly.

* Being able to reliably notice concurrent page deletions that could
affect what our insertion scan key can expect to have to compare next
would give me an easy way to make dynamic prefix truncation work
correctly.

We can safely invalidate the skip attribute hint when an ignorable
(half-dead or dead) page is found, starting afresh from the page
pointed to by the half-dead page's outlink -- a page deletion that we
cannot possibly notice to the immediate left of the child causes no
problems. In particular, no concurrent insertion that inserts into a
part of the key space dynamic prefix truncation reasoned that it had
already eliminated could actually occur. The race condition can be
reliably noticed and recovered from as part of handling a concurrent
page deletion/following ignorable page's "outlink".

* Dynamic prefix truncation is not that different to actual prefix
truncation, and is therefore probably going to also block on solving
this problem.

If we ever want a configurable leaf level compression feature, we'll
probably have to confront the same issue. We'd probably end up adding
a "low key" at the leaf level when the DBA turned that feature on. How
is that supposed to work with the current design?

* Merging not-fully-empty nodes seems much easier if we wanted to go that way.

The L&S paper is very clear about the fact that it supports merging
pages that aren't yet completely empty -- they have no restrictions on
the page being totally empty. I'm pretty sure that this is the only
reason for nbtree's restriction on page deletion. I'm not actually
interested in pursuing this one myself, but I include it to aid
understanding of what I'm talking about. I remember Kevin Grittner
(CC'd) expressing interest in this project.

* We could make amcheck's bt_index_parent_check() SQL-callable
function only need an AccessShareLock, perhaps, because concurrent
VACUUM/page deletion becomes a problem it can reason about and recover
from.

The race that makes dynamic prefix truncation unworkable is exactly
the same as the race I go on about at length within amcheck's
bt_downlink_check() function -- I guess you could say that I noticed
this problem a few years ago, but I'm only now connecting the dots.
Concurrent VACUUMs could instead be dealt with there by making use of
a trick that's similar to the existing cross-page invariant trick
that's explained at length within bt_right_page_check_scankey(). The
general idea, once again, is that seeing a half-dead child page
becomes a reliable canary condition. See also: the enormously detailed
comments I wrote within bt_target_page_check() about why this is okay
to do this across siblings in the case of bt_index_check(), the
SQL-callable verification function that already only needs an
AccessShareLock.

[1] https://postgr.es/m/CAH2-Wz=apbKyaFhEfRN3UK_yXZ8DSE4Ybr0A3D87=4JWyy1QPA@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Sample_Databases
[3] https://archive.org/stream/symmetricconcurr00lani
--
Peter Geoghegan

Attachment Content-Type Size
v9-0007-POC-Add-dynamic-prefix-truncation-to-nbtree.patch application/x-patch 10.0 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-12-04 03:26:31 Re: Index Skip Scan
Previous Message Peter Geoghegan 2018-12-04 03:10:37 Re: Making all nbtree entries unique by having heap TIDs participate in comparisons