Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-09-24 17:26:36
Message-ID: CAH2-WzkrxpuS8tmGap269eph8K2V+DUioK=7U=NuMJUp7Qx2kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> 3 modes
> -------
>
> My new approach is to teach _bt_findsplitloc() 3 distinct modes of
> operation: Regular/default mode, many duplicates mode, and single
> value mode.

I think that I'll have to add a fourth mode, since I came up with
another strategy that is really effective though totally complementary
to the other 3 -- "multiple insertion point" mode. Credit goes to
Kevin Grittner for pointing out that this technique exists about 2
years ago [1]. The general idea is to pick a split point just after
the insertion point of the new item (the incoming tuple that prompted
a page split) when it looks like there are localized monotonically
increasing ranges. This is like a rightmost 90:10 page split, except
the insertion point is not at the rightmost page on the level -- it's
rightmost within some local grouping of values.

This makes the two largest TPC-C indexes *much* smaller. Previously,
they were shrunk by a little over 5% by using the new generic
strategy, a win that now seems like small potatoes. With this new
mode, TPC-C's order_line primary key, which is the largest index of
all, is ~45% smaller following a standard initial bulk load at
scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020
blocks (429.84 MiB). It's actually slightly smaller than it would be
after a fresh REINDEX with the new strategy. We see almost as big a
win with the second largest TPC-C index, the stock table's primary key
-- it's ~40% smaller.

Here is the definition of the biggest index, the order line primary key index:

pg(at)tpcc[3666]=# \d order_line_pkey
Index "public.order_line_pkey"
Column │ Type │ Key? │ Definition
───────────┼─────────┼──────┼────────────
ol_w_id │ integer │ yes │ ol_w_id
ol_d_id │ integer │ yes │ ol_d_id
ol_o_id │ integer │ yes │ ol_o_id
ol_number │ integer │ yes │ ol_number
primary key, btree, for table "public.order_line"

The new strategy/mode works very well because we see monotonically
increasing inserts on ol_number (an order's item number), but those
are grouped by order. It's kind of an adversarial case for our
existing implementation, and yet it seems like it's probably a fairly
common scenario in the real world.

Obviously these are very significant improvements. They really exceed
my initial expectations for the patch. TPC-C is generally considered
to be by far the most influential database benchmark of all time, and
this is something that we need to pay more attention to. My sense is
that the TPC-C benchmark is deliberately designed to almost require
that the system under test have this "multiple insertion point" B-Tree
optimization, suffix truncation, etc. This is exactly the same index
that we've seen reports of out of control bloat on when people run
TPC-C over hours or days [2].

My next task is to find heuristics to make the new page split
mode/strategy kick in when it's likely to help, but not kick in when
it isn't (when we want something close to a generic 50:50 page split).
These heuristics should look similar to what I've already done to get
cases with lots of duplicates to behave sensibly. Anyone have any
ideas on how to do this? I might end up inferring a "multiple
insertion point" case from the fact that there are multiple
pass-by-value attributes for the index, with the new/incoming tuple
having distinct-to-immediate-left-tuple attribute values for the last
column, but not the first few. It also occurs to me to consider the
fragmentation of the page as a guide, though I'm less sure about that.
I'll probably need to experiment with a variety of datasets before I
settle on something that looks good. Forcing the new strategy without
considering any of this actually works surprisingly well on cases
where you'd think it wouldn't, since a 50:50 page split is already
something of a guess about where future insertions will end up.

[1] https://postgr.es/m/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@mail.gmail.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-09-24 17:53:05 Re: "could not reattach to shared memory" on buildfarm member dory
Previous Message Tom Lane 2018-09-24 17:18:47 Re: Allowing printf("%m") only where it actually works