Nasty btree deletion bug

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Subject: Nasty btree deletion bug
Date: 2006-10-25 17:14:56
Message-ID: 2766.1161796496@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been analyzing Ed L's recent report of index corruption:
http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php
(thanks to Ed for letting me have the actual index to study).
I think I've figured out what happened to it. nbtree/README says

: The notion of a half-dead page means that the key space relationship between
: the half-dead page's level and its parent's level may be a little out of
: whack: key space that appears to belong to the half-dead page's parent on the
: parent level may really belong to its right sibling. We can tolerate this,
: however, because insertions and deletions on upper tree levels are always
: done by reference to child page numbers, not keys. The only cost is that
: searches may sometimes descend to the half-dead page and then have to move
: right, rather than going directly to the sibling page.

but unfortunately this analysis is too simplistic. In the situation
where a half-dead page is its parent's rightmost child (which is the
only case where we expect half-deadness to persist long), that page's
former key space now belongs to its right sibling, which means that
some keys that are less than the parent's "high key" now belong to the
keyspace of pages below the parent's right sibling. This is OK as far
as search behavior goes --- but suppose that we get a continuing stream
of insertions of new keys in that key range. This will result in page
splits that cause keys in that key range to bubble up into the upper
levels. If that keeps happening long enough, eventually we will split
the parent's right sibling at a key value less than the parent's high
key, and then we will insert that key into the grandparent level just
to the right of the parent's right sibling. Now we have an index
that's actually corrupt, because we have out-of-order index keys in
the grandparent level, which can cause searches for keys in their range
to fail (a search may descend too far to the right to find the entries
it should have found).

Since only internal pages can be half-dead, this failure requires at
least a three-level index, and it requires enough deletions within a
small range for for a level-1 page to become empty (hence half dead)
followed by a large number of insertions in that same range. Ed's
index was probably more prone to this than average because he was
indexing very wide values (~500 byte text), leading to low btree fanout
and a relatively narrow value range for a level-1 page. Still I'm a
bit surprised that we've not figured it out before, because the bug is
presumably present all the way back to 7.4 when the btree deletion code
was added.

I haven't thought of a suitable fix yet --- clearly we're going to have
to change the concept of half-deadness to some extent. But I have to
leave for a dentist appointment, so I figured I'd post what I know.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message JEAN-PIERRE PELLETIER 2006-10-25 18:05:22 Re: [JDBC] server process (PID 1188) exited with exit code
Previous Message Casey Duncan 2006-10-25 16:49:31 Re: [DOCS] Replication documentation addition