Re: B-tree parent pointer and checkpoints

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-11 11:22:55
Message-ID: 4CDBD20F.8090807@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.11.2010 00:49, Tom Lane wrote:
> I wrote:
>> What happens if you error out in between? Or is it assumed that the
>> *entire* sequence is a critical section? If it has to be that way,
>> one might wonder what's the point of trying to split it into multiple
>> WAL records.
>
> Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
> a bad idea that we should just rip out.
>
> The question that ought to be asked here, I think, is whether it
> shouldn't be required that every inter-WAL-record state is a valid
> consistent state that doesn't require post-crash fixups. If that
> isn't the case, then a simple ERROR or FATAL exit out of the backend
> that was creating the sequence originally will leave the system in
> an unacceptable state. We could prevent such an exit by wrapping the
> whole sequence in a critical section, but if you have to do that then
> it's not apparent why you shouldn't fold it into one WAL record.
>
> IOW, forget this patch. Take out the logic that tries to complete
> pending splits during replay, instead. I believe this is perfectly safe
> for btree: loss of a parent record isn't fatal, as proven by the fact
> that searches don't have to be locked out while a split proceeds.
> (We might want to make btree_page_del not think that a missing parent
> record is an error, but it shouldn't think that anyway, because of the
> possibility of a non-crashing failure during the original split.)
> This approach might not be safe for GIST or GIN; but if it isn't, they
> need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

> The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in
log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent.

Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a
page that's not represented by the parent pointer? It's possible that
you insert a key to the leaf, update the leaf's immediate parent, but
crash before updating the parent's parent. As far as search is
concerned, that's OK as well, but it creates a hazard for subsequent
inserts. If you later insert a tuple with the same key to the same leaf
page, the insertion will see that the parent pointer already includes
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-11-11 11:24:03 Re: B-tree parent pointer and checkpoints
Previous Message Peter Eisentraut 2010-11-11 06:38:20 Re: Exposing an installation's default value of unix_socket_directory