Re: btree shrinking again

From: "Curtis Faith" <curtis(at)galtair(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Alvaro Herrera" <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: "Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: btree shrinking again
Date: 2002-11-18 16:16:56
Message-ID: DMEEJMCDOJAKPPFACMPMOEFACFAA.curtis@galtair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> > + Deletions are handled by getting a super-exclusive lock on the target
> > page, so that no other backend has a pin on the page when the deletion
> > starts. This means no scan is pointing at the page. This is OK for
> > deleting leaf items, probably not OK for deleting internal nodes;
> > will need to think harder when it's time to support index compaction.
>
> > In what cases is not OK to delete an item from an internal node, holding
> > a super-exclusive lock?
>
> I believe the thing I was worried about when I wrote that note was the
> stack of ancestor pointers maintained by an insert operation: the insert
> will not have pins on those pages, but might try to return to them
> later (to service a page split).

tom lane wrote:
> A simple-minded solution might be to keep the pins until the insert is
> done, but you'd have to think about possible deadlock conditions as well
> as loss of concurrency. I'd prefer to find a solution that didn't
> require that.

There is an algorithm that work pretty well for this and that don't require
holding pins up the tree in most cases. The locking for deletes mirrors that
for inserts. In each case, either a delete or an insert could require a
change to the parent page recursively up to and including the root page.

Traversals pin read-only starting at the root, after each page pin is
acquired pins on previous, more interior pages are released. Leaf pages are
pinned read/write for insert or delete cases. The strict ordering of lock
acquisition pinning from the root toward the more exterior/leaf nodes, and
never from the leaf up, will prevent deadlocks.

Each page keeps a version number that is incremented after each change (you
don't really have to worry about rollover on this since checks against the
version number are for equality and inequality only).

If an insert will cause a split and hence an insert for the interior page
then the parent interior page of the page that has been split is pinned for
write without waiting (waiting could cause deadlocks since it would violate
the root to leaf acquisition ordering mentioned above). If the pin fails
because another lock is held on that page, or the page is not there; or after
pinning the version number is different than the version number found when
the page was originally pinned on the way down to the leaf, the insert is
aborted.

In case of abort, the btree is searched again starting at the root going down
to the page which needs to be inserted write pinning each of the pages on the
way down.

Another variation simply retries the insert again in the case of aborts but
acquires a write lock on page one level up from the lowest level of the last
try. In the case of the first abort for a four-level tree this would acquire
a read lock on the root node (level 0) and level 1, and write locks on levels
2 and 3, since level 3 failed the previous time.

This method provides better concurrency at the expense of slightly more work
for inserts or deletes that require major restructuring of the btrees.

Leaf pages can be pinned across the bottom of the tree for index scans but
never up the tree. This means that leaf nodes require sibling pointers in
order to support index scans. I don't know if PostgreSQL's btree has these
already.

I'm sure I have missed some detail but this is the general idea.

- Curtis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-11-18 16:19:25 Re: DBD::PostgreSQL
Previous Message David Wheeler 2002-11-18 16:16:03 Re: DBD::PostgreSQL