|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Subject:||btree split logic is fragile in the presence of large index items|
|Views:||Raw Message | Whole Thread | Download mbox|
Jan sent me a test case that produces a "btree: failed to add item to
the page" failure message. The cause is that a btree index page needs
to be split to make room for a new index item, but the split point is
chosen in such a way that there still isn't room for the new item on
the new page it has to go into.
This appears to be a long-standing bug, but it's a problem that's not
very likely to occur unless you are dealing with large index items
(approaching the pagesize/3 limit).
In the test case, the initial contents of the index page look like
0,7,1 1792 ("high key" from next page)
(There are actually only four "real" keys on this page; the first entry
is a copy of the lowest key from the next index page. BTW I believe
these "high keys" are the source of the extra index TOAST references
that Jan was wondering about.)
After bt_split executes, I find the left page has
0,85,1 1848 ("high key" from newly-inserted page)
and the right page has
0,7,1 1792 (still the high key from the next page)
Unfortunately, the incoming item of size 1416 has to go onto the right
page, and it still doesn't fit.
In this particular situation the choice of the bad split point seems
to be due to sloppy logic in _bt_findsplitloc --- it decides to split
the page "in the middle" but its idea of the middle is
firstindex + (lastindex - firstindex) / 2
which is off by one in this case, and is fundamentally wrong anyhow
because middle by item count is not necessarily middle by size.
This could be repaired with some work in findsplitloc, but I'm afraid
there is a more serious problem. The comments before findsplitloc say
* In order to guarantee the proper handling of searches for duplicate
* keys, the first duplicate in the chain must either be the first
* item on the page after the split, or the entire chain must be on
* one of the two pages. That is,
* [1 2 2 2 3 4 5]
* must become
*  [2 2 2 3 4 5]
* [1 2 2 2] [3 4 5]
* but not
* [1 2 2] [2 3 4 5].
* [2 2 2 2 2 3 4]
* may be split as
* [2 2 2 2] [2 3 4].
If this is accurate, then there will be cases where it is impossible to
split a page in a way that allows insertion of the new item where it
needs to go. For instance, suppose in the above example that the three
large items had been all equal keys and that the incoming item also has
that same key value. Under the rules given in this comment the only
legal split would be like [A] [B B B B] where A represents the 40-byte
key and the B's indicate the four equal keys. That will not fit.
However, I'm not sure I believe the comment anymore: it has not changed
since Postgres95 and I can see that quite a bit of work has been done
on the duplicate-key logic since then. Furthermore findsplitloc itself
sometimes ignores the claimed requirement: when it does the
split-in-the-middle case quoted above, it does not pay attention to
whether it is splitting in the middle of a group of duplicates. (But
that path is taken infrequently enough that it's possible it's just
plain broken, and we haven't noticed.)
Does anyone know whether this comment still describes the btree
equal-key logic accurately?
If so, one possible solution is to make _bt_insertonpg smart enough to
detect that there's not room to insert after making a legal split, and
then recursively split again. That looks fairly ticklish however,
especially if we want to preserve the existing guarantees of no deadlock
in concurrent insertions.
A more radical way out is to do what Vadim's been saying we should do
eventually: redo the btree logic so that there are never "equal" keys
(ie, use the item TID as a tiebreaker when ordering items). That would
fix our performance problems with many equal keys as well as simplify
the code. But it'd be a good deal of work, I fear.
Comments, opinions, better ideas?
regards, tom lane
|Next Message||Tom Lane||2000-07-18 20:55:01||Re: Untrusted PL/Tcl?|
|Previous Message||Ferruccio Zamuner||2000-07-18 20:11:36||PRIMARY KEY & INHERITANCE (fwd)|