Re: Brain dump: btree collapsing

From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 17:59:53
Message-ID: 001401c2d389$b6db2270$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

tom lane wrote:
> Got any evidence to back that up? I was relying on
>
> [Johnson89] Johnson, T. and Shasha, D. Utilization of
> B-trees with Inserts, Deletes and Modifies ACM Symp. on
> PODS, 235-246, 1989.
>
> which provides a ton of math and simulations leading up to
> the conclusion that collapsing btree pages before they are
> fully empty doesn't really gain anything meaningful in terms
> of storage utilization, in most scenarios.

Well, I don't have that specific paper handy but I do have [JS93]
Theodore Johnson , Dennis Shasha, "B-trees with inserts and deletes: why
free-at-empty is better than merge-at-half" which appears to be their
later thinking on the same subject.

Note the following:

"A merge-at-half B-tree will always have a space utilization of at least
50%. When all operations are modify operations, or when the number of
insert operations is the same as the number of delete operations, then
the utilization will be about 60%. In contrast, a free-at-empty B-tree
has a 0% lower bound on its space utilization, and will have about 39%
utilization on a pure-modify instruction mix. However, the space
utilization of a free-at-empty B-tree remains high if there are just a
few more insert operations than delete operations. Thus, merge-at-half
usually buys little in terms of space utilization.

In Figure 6, we showed that the restructuring rate of a merge-at-half
B-tree is significantly larger than the restructuring rate of a
free-at-empty B-tree for all values of q * :1. For many concurrent
B-tree algorithms used in practice [4, 13], restructuring causes a
serialization bottleneck. Thus, one simple but important way to increase
concurrency in B-tree operations is to reduce the probability of
restructuring. Since merge-at-half buys little space utilization but is
expensive in terms of restructuring, we recommend that B-trees
(especially concurrently accessed ones) use free-at-empty."

I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys.

However, as [Jannink]: "Implementing Deletion in B+-trees. SIGMOD
RECORD, v.24, n.1, p.33-38, 1995" points out that assumption doesn't
hold under other possibly common circumstances, specifically
circumstances where the deletes are taking place in significant sections
of the index at a much higher rate than inserts.

Consider from [Jannink95]:

"There has been some research on the acceptability of relaxing the
constraint of minimum node size to reduce the number of so-called unsafe
tree operations, i.e., those which contain node splitting and merging
[ZH89].

The debate has culminated in analysis of a weaker form of the deletion
algorithm which we call lazy deletion, that imposes no constraints on
the number of entries left in the nodes, allowing them to empty
completely before simply removing them. According to [GR93], most
database system implementations of B+-trees have adopted this approach.
Its most effective use is when it is vital to allow concurrent access to
the tree [JS93b], and excessive splitting and merging of nodes would
restrict concurrency. [JS89] derives some analytic solutions calculating
memory utilization for B+-trees under a mix of insertions and lazy
deletions, based on previous research which considered insertions only
[BY89]. The simulations in [JS89] support its analysis to show that in
typical situations, where deletions don't outnumber insertions in the
mix of operations, the tree nodes will contain acceptable percentages of

entries.

One of the work's assumptions [JS93a] is that the keys and tree
operations are chosen uniformly from a random distribution. This
assumption is unreasonable in certain realistic situations such as one
described below. Allowing interior nodes with only a single pointer to
exist in a B+-tree creates the possibility for arbitrarily unbalanced
trees of any height, which are virtually empty, and in which access
times have degenerated from the logarithmic bound B+-trees are meant to
guarantee to a worst case unbounded access time. Since nodes are not
removed until they are completely empty, the lazy deletion algorithm
does not regulate tree height effectively."

Jannink then illustrates an example where an index is created based on a
timestamp where the basic assumption of Johnson and Sasha does not hold
since it is not a random distribution but a monotonically increasing
value. His example is an extreme one but I believe there are many
instances where a timestamp, sequence or some other monotonically
increasing value is used in an index and where deletes are taking place
much more frequently for largely older values.

Since sequences are often used as foreign keys a significant number of
indexes fit into this category.

Consider a web site that tracked users and that deleted inactive
accounts. There are many real-world scenarios where the number of
inactive accounts is very high as a percentage of all accounts. Consider
an example where 50% of the accounts are inactive and deleted after 90
days of disuse and 90% are inactive after 180 days.

If the Account is implemented with an ID based on a sequence, any tables
that referenced the inactive account's sequence based ID via foreign
keys would have index entries that would be sparsely populated for the
older entries but might not have a high percentage of completely empty
index pages.

The older, but still active accounts would keep the index pages from
being completely empty in many cases.

Further, Johnson and Sasha make the claim that "free-at-empty is better"
in the specific context of restructuring causing a serialization
bottleneck. I don't think this applies to the specific case I
recommended where the parent is the same for both leaves, especially
during a VACUUM process where presumably we can optimize for concurrency
rather than the speed of VACUUM itself.

- Curtis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Brusser 2003-02-13 18:05:48 Do we always need the socket file?
Previous Message Vince Vielhaber 2003-02-13 17:52:23 Re: location of the configuration files