Re: B-tree parent pointer and checkpoints

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: B-tree parent pointer and checkpoints
Date: 2011-09-06 10:21:28
Message-ID: 4E65F428.6030306@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05.09.2011 21:55, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> On 11.03.2011 19:41, Tom Lane wrote:
>>> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>>> On 11.03.2011 17:59, Tom Lane wrote:
>>>>> But that will be fixed during WAL replay.
>>>
>>>> Not under the circumstances that started the original thread:
>>>
>>>> 1. Backend splits a page
>>>> 2. Checkpoint starts
>>>> 3. Checkpoint runs to completion
>>>> 4. Crash
>>>> (5. Backend never got to insert the parent pointer)
>>>
>>>> WAL replay starts at the checkpoint redo pointer, which is after the
>>>> page split record, so WAL replay won't insert the parent pointer. That's
>>>> an incredibly tight window to hit in practice, but it's possible in theory.
>>>
>>> Hmm. It's not so improbable that checkpoint would start inside that
>>> window, but that the parent insertion is still pending by the time the
>>> checkpoint finishes is pretty improbable.
>>>
>>> How about just reducing the deletion-time ERROR for missing downlink to a LOG?
>>
>> Well, the code that follows expects to have a valid parent page locked,
>> so you can't literally do just that. But yeah, LOG and aborting the page
>> deletion seems fine to me.
>
> Did this get fixed?

Nope.

On a closer look, this isn't only a problem for page deletion. Page
splitting also barfs if it can't find the parent of a page. As the code
stands, a missing downlink is not harmless, but causes all sorts of trouble.

The window for this to happen with a checkpoint is extremely tight, but
there's another situation where you can end up with a missing downlink:
if you run out of disk space while splitting a parent page, to insert a
downlink to it.

I think we should do a similar fix to b-tree that I did to GiST, and put
a flag on pages with missing downlinks. Then we can fix the missing
downlinks in vacuum and insertion, and get rid of the code to fix
incomplete splits after WAL replay.

The way it would work is that on page split the right page is flagged
with MISSING_DOWNLINK flag. When the downlink is inserted into the
parent, the flag is cleared in the same critical section as the WAL
record for the insertion of the parent is written. Normally, a backend
would never see the flag set, because the locks on the split pages are
not released until the parent record is written and the flag cleared
again. But if inserting the downlink fails for any reason, the next
inserter or vacuum that steps on the page can finish the split by
inserting the downlink.

Unfortunately that means holding the locks on the split pages longer
than we do at the moment. Currently they are released as soon as the
parent page is locked; with this change they would need to be held until
the WAL record of the downlink insertion is done. B-tree is so heavily
used that I'm a bit hesitant to sacrifice any concurrency there, but I
don't think it would be noticeable in practice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message hubert depesz lubaczewski 2011-09-06 10:32:22 Re: [GENERAL] pg_upgrade problem
Previous Message Marti Raudsepp 2011-09-06 09:01:04 Re: Redundant bitmap index scans on smallint column