Re: Race condition in b-tree page deletion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in b-tree page deletion
Date: 2013-11-11 08:03:00
Message-ID: 52808F34.6020000@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10.11.2013 01:47, Robert Haas wrote:
> I think we've tried pretty hard to avoid algorithms where the maximum
> number of lwlocks that must be held at one time is not a constant, and
> I think we're in for a bad time of it if we start to deviate from that
> principal. I'm not sure what to do about this problem, but I think
> locking N levels of the tree at once, where N can be as large as the
> tree is deep, is probably a bad plan, whether the number of locks
> required is N or 3N.

I think I found a solution that accomplishes that. It's actually not
that complicated:

Like we currently do, first climb up the tree to check that it's safe to
delete, ie. the downlink in the first non-empty parent is not the
rightmost entry. But when we reach the level where the parent is
non-empty - I'll call that the "topmost" parent - we keep that page
locked. The leaf page is kept locked while we climb.

This is enough to plug the race condition. As long as we hold a lock on
the topmost parent containing the downlink to the branch of pages we're
about to delete, it cannot become the rightmost entry. And as long as we
hold a lock on the leaf page, no new insertions can happen on any of the
internal pages in the branch, as insertions to internal pages only
happen when a child is split. However, the rest of the algorithm needs
to be slightly modified, as we cannot re-lock the lower-level pages
until we release the lock on the topmost page, to avoid deadlock.

So at this point, we hold two locks: the leaf page, and the topmost
parent containing the downlink to the branch we're deleting. Next, we
remove the downlink from the topmost parent, and mark the leaf page as
half-dead in one atomic operation. Also, a pointer to the highest
internal page in the branch we're deleting - the one the removed
downlink pointed to - is put on the leaf page. We can now release the locks.

At this point, searches and inserts work fine. The leaf page has been
marked as half-dead, so any insertions to the deleted page's keyspace
will go to its right sibling. The downlink is to the top of the branch
is gone, so even if the right sibling is split many times, any keys in
the transferred keyspace that propagate to the higher levels won't be
out-of-order.

All that is left is to unlink the all the lingering pages in the branch
we're deleting from their left and right siblings. This can be done at
any later time, and if we error out or crash for some reason, next
vacuum that comes along can finish the job. This is done one level at a
time. Lock the leaf page, and the internal page the leaf page points to,
and the internal page's left and right siblings (in the right order, not
this order). Change the left and right sibling's right- and left-links,
mark the internal page as deleted, and update the pointer in the leaf
page to point to the child of the deleted internal page. Then recurse to
the child, until we reach the leaf level.

This has the nice extra property that we don't need the
incomplete-action tracking in WAL recovery. I'd like to get rid of that
anyway.

I'm not sure what to do about stable branches. This could be
back-patched, with the caveat that this introduces new WAL record types
so standbys need to be upgraded before the master. But given the lack of
field reports in the ten years this race condition has existed, I'm not
sure it's worth the hassle.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2013-11-11 08:53:49 Re: Minmax indexes
Previous Message Tomas Vondra 2013-11-11 07:00:38 Re: Clang 3.3 Analyzer Results