Quick Links

Re: B-tree parent pointer and checkpoints

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject:	Re: B-tree parent pointer and checkpoints
Date:	2010-11-12 19:20:55
Message-ID:	4CDD9397.8040600@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11.11.2010 20:34, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Hmm, we don't currently keep track of that when we descend the tree to
>> choose the target page, but perhaps an extra Consistent call to check
>> that would be acceptable. We already call Penalty for every tuple on
>> each internal node on the way, so compared to that one more call should
>> not be too expensive. If we do that, I think it would simplify the
>> algorithm quite a bit to just update all the parents on the way down,
>> instead of traversing up from the bottom after inserting the tuple to
>> the leaf.
>
> Oh, that's a really good idea, I think. But what about page splits?
> I guess in the case of a split, you'd be replacing the parent entry
> anyway, so having previously updated it to something larger doesn't
> really cause a problem other than wasting a few cycles --- which are
> probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. The
right-links in GiST work differently from b-tree, a right-link is only
followed if we detect a concurrent page split. A concurrent split is
detected by comparing the "NSN" field on the child page against the LSN
that we saw on the parent when walking down. It means that if you just
leave the incompletely split page in the tree, where one half is missing
the parent pointer, scans will not find any tuples on that page. They
would at first, but as soon as the the parent page is updated due to
some unrelated insertion, the LSN of the parent is bumped above the NSN
stored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent page
locked while the child is split, until the downlink is inserted into the
parent. That blocks any other modifications to the parent page that
would bump the LSN, until our downlink has been inserted. That doesn't
work after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the page
split algorithm. In a nutshell, when the child page is split, put a flag
on the left half indicating that the rightlink must always be followed,
regardless of the NSN. When the downlink is inserted to the parent,
clear the flag. Setting and clearing of these flags need to be performed
during WAL replay as well.

So to split a page:

(0. Lock the page to be split)
1. Split the page. Mark the rightlink in the left half with a flag
indicating that it always needs to be followed.
2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)
4. Clear the flag in the child, and update NSN to the LSN of the
downlink insertion record.
5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, so
scans will know to always follow it. If we crash after step 3, recovery
will replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequent
inserts, by performing steps 2-4.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Re: B-tree parent pointer and checkpoints at 2010-11-11 18:34:46 from Tom Lane

Responses

Re: B-tree parent pointer and checkpoints at 2010-11-12 22:34:51 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dimitri Fontaine	2010-11-12 20:13:59	Re: wCTE behaviour
Previous Message	Tom Lane	2010-11-12 19:20:22	Re: WIP: extensible enums