Re: BTree metapage lock and freelist structure

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: BTree metapage lock and freelist structure
Date: 2002-10-07 18:31:04
Message-ID: 25142.1034015464@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> There doesn't seem to be a way to acquire two different locks on the
> same page, so I propose to lock the InvalidBlockNumer for the btree, and
> use that as the lock to be obtained before doing extend, addfree,
> getfree or shrink. The setroot/getroot operations would still use the
> locking on BTREE_METAPAGE, so they can proceed whether the
> InvalidBlockNumber is blocked or not.

Unfortunately, blkno == InvalidBlockNumber is already taken --- it's the
encoding of a relation-level lock. Compare LockRelation and LockPage
in src/backend/storage/lmgr/lmgr.c.

It looks to me like nbtree.c is currently using LockPage(rel, 0) to
interlock relation extension --- this is the same convention used to
interlock heap extension. But access to the metapage is determined by
buffer locks, which are an independent facility.

I agree with your conclusion that extend, shrink, addfree, and getfree
operations may as well all use the same exclusive lock. I think it
could be LockPage(rel, 0), though.

Since addfree/getfree need to touch the metapage, at first glance it
would seem that they need to do LockPage(rel, 0) *plus* get a buffer
lock on the metapage. But I think it might work for them to get only
a shared lock on the metapage; the LockPage lock will guarantee
exclusion against other addfree/getfree operations, and they don't
really care if someone else is changing the root pointer. This is ugly
but given the limited set of operations that occur against a btree
metapage, it seems acceptable to me. Comments anyone?

> On a different topic: the freelist structure I think should be
> represented as a freelist int32 number (btm_numfreepages) in
> BTMetaPageData, and a pointer to the first BlockNumber. Adding a new
> page is done by sequentially scanning the list until a zero is found or
> the end of the block is reached. Getting a page sequentially scans the
> same list until a blocknumber > 0 is found (iff btm_numfreepages is
> greater than zero). This allows for ~ 2000 free pages (very unlikely to
> actually happen if the shrink operation is in place).

No more than 2000 free pages in an index? What makes you think that's
unlikely? It's only 16 meg of free space, which would be about 1% of
a gigabyte-sized index. I think it's a bad idea to make such a
hardwired assumption.

The original concept I believe was to have a linked list of free pages,
ie, each free page contains the pointer to the next one. This allows
indefinite expansion. It does mean that you have to read in the
selected free page to get the next-free-page pointer from it, which you
have to do to update the metapage, so that read has to happen while
you're still holding the getfree lock. That's annoying.

Perhaps we could use a hybrid data structure: up to 2000 free pages
stored in the metapage, with any remaining ones chained together.
Most of the time, freelist operations wouldn't need to touch the
chain, hopefully.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2002-10-07 18:32:54 Re: Analysis of ganged WAL writes
Previous Message Curtis Faith 2002-10-07 18:31:00 Dirty Buffer Writing [was Proposed LogWriter Scheme]