Re: Race condition in b-tree page deletion

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in b-tree page deletion
Date: 2013-11-09 23:47:33
Message-ID: CA+TgmoZQRJyhwMc_y-2rugbOL7LpTp=3+yW2KaXvTkOJ653C1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 9, 2013 at 12:40 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 09.11.2013 19:18, Heikki Linnakangas wrote:
>> On 09.11.2013 18:49, Heikki Linnakangas wrote:
>>> We could just punt if more than X pages would need to be changed. That
>>> would mean that we never delete pages at the top (h - X) levels of the
>>> tree. In practice that should be fine if X is high enough.
>>> As a data point, GIN list page deletion holds 16 pages locked at once
>>> (GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
>>> As another data point, in the worst case the keys are so wide that only
>>> 2 keys fit on each B-tree page. With that assumption, the tree can be at
>>> most 32 levels deep if you just insert into it with no deletions, since
>>> MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
>>> sure). Deletions make it more complicated, but I would be pretty
>>> surprised if you can construct a B-tree tallers than, say 40 levels.
>>
>> On further thought, it's worse than that. To delete a page, you need to
>> lock the left and right siblings, so you need 3 pages locked per each
>> level you delete...
>
> On further further thought, we don't need to unlink the pages immediately.
> It's enough to mark them as half-dead and remove the downlink to the upmost
> half-dead page. Marking a page as half-dead is as good as deleting it
> outright as far as searches and insertions are concerned. Actually unlinking
> the pages from the left and right siblings can be done at any later time,
> and doesn't need to be done in any particular order.
>
> So, my original musings about the number of concurrent locks needed still
> holds.

I think we've tried pretty hard to avoid algorithms where the maximum
number of lwlocks that must be held at one time is not a constant, and
I think we're in for a bad time of it if we start to deviate from that
principal. I'm not sure what to do about this problem, but I think
locking N levels of the tree at once, where N can be as large as the
tree is deep, is probably a bad plan, whether the number of locks
required is N or 3N.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-11-09 23:57:36 Re: Comment - uniqueness of relfilenode
Previous Message Kevin Grittner 2013-11-09 23:34:32 Re: Fw: [COMMITTERS] pgsql: Fix blatantly broken record_image_cmp() logic for pass-by-value