From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org> |
Subject: | Re: Failure while inserting parent tuple to B-tree is not fun |
Date: | 2013-10-22 18:29:13 |
Message-ID: | 5266C3F9.4020803@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 22.10.2013 21:25, Andres Freund wrote:
> On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
>> Splitting a B-tree page is a two-stage process: First, the page is split,
>> and then a downlink for the new right page is inserted into the parent
>> (which might recurse to split the parent page, too). What happens if
>> inserting the downlink fails for some reason? I tried that out, and it turns
>> out that it's not nice.
>>
>> I used this to cause a failure:
>>
>>> --- a/src/backend/access/nbtree/nbtinsert.c
>>> +++ b/src/backend/access/nbtree/nbtinsert.c
>>> @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
>>> _bt_relbuf(rel, pbuf);
>>> }
>>>
>>> + elog(ERROR, "fail!");
>>> +
>>> /* get high key from left page == lowest key on new right page */
>>> ritem = (IndexTuple) PageGetItem(page,
>>> PageGetItemId(page, P_HIKEY));
>>
>> postgres=# create table foo (i int4 primary key);
>> CREATE TABLE
>> postgres=# insert into foo select generate_series(1, 10000);
>> ERROR: fail!
>>
>> That's not surprising. But when I removed that elog again and restarted the
>> server, I still can't insert. The index is permanently broken:
>>
>> postgres=# insert into foo select generate_series(1, 10000);
>> ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5
>>
>> In real life, you would get a failure like this e.g if you run out of memory
>> or disk space while inserting the downlink to the parent. Although rare in
>> practice, it's no fun if it happens.
>
> Why doesn't the incomplete split mechanism prevent this? Because we do
> not delay checkpoints on the primary and a checkpoint happened just
> befor your elog(ERROR) above?
Because there's no recovery involved. The failure I injected (or an
out-of-memory or out-of-disk-space in the real world) doesn't cause a
PANIC, just an ERROR that rolls back the current transaction, nothing more.
We could put a critical section around the whole recursion that inserts
the downlinks, so that you would get a PANIC and the incomplete split
mechanism would fix it at recovery. But that would hardly be an improvement.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Janes | 2013-10-22 18:33:08 | Re: Add min and max execute statement time in pg_stat_statement |
Previous Message | Josh Berkus | 2013-10-22 18:27:47 | Location for external scripts for Extensions? |