Re: B-tree parent pointer and checkpoints

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2010-11-16 16:06:21
Message-ID: 4CE2ABFD.5090505@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13.11.2010 00:34, Greg Stark wrote:
> On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I think we can work around that with a small modification to the page split
>> algorithm. In a nutshell, when the child page is split, put a flag on the
>> left half indicating that the rightlink must always be followed, regardless
>> of the NSN. When the downlink is inserted to the parent, clear the flag.
>> Setting and clearing of these flags need to be performed during WAL replay
>> as well.
>
> Does this not cause duplicate results? Or does GIST already have to be
> prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2010-11-16 16:12:53 autovacuum maintenance_work_mem
Previous Message Magnus Hagander 2010-11-16 15:44:12 Re: Isn't HANDLE 64 bits on Win64?