Re: Brain dump: btree collapsing

From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Hannu Krosing'" <hannu(at)tm(dot)ee>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 18:00:39
Message-ID: 000a01c2d452$fd3e4ac0$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

tom lane wrote:
> How does that help? The left-moving indexscan still has no
> way to recover. It can't go back to the page it was on
> before and try to determine which entries you added there,
> because it has no reliable reference point to do so. The
> entry it previously returned might not be there anymore, and
> in a non-unique index key comparisons won't help. And even
> more to the point, how would it know you've changed the left
> page? It has no idea what the previous "page version" on the
> left page was, because it was never there before.

Revisiting the idea I proposed in a previous email after more research,
I believe it is definitely possible to implement a mutex based scheme
that will prevent scans from being polluted by merges of index pages and
that does not result in the mutex being held for any significant
duration.

1) Current scan code does this:

bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
... definitions go here...

if (ScanDirectionIsForward(dir))
{
if (!PageIsEmpty(page) && offnum < maxoff)
offnum = OffsetNumberNext(offnum);
else
{
/* walk right to the next page with data */
for (;;)
{
/* if we're at end of scan, release the
buffer and return */
... skip code here...

/* step right one page */
blkno = opaque->btpo_next;
_bt_relbuf(rel, *bufP);
*bufP = _bt_getbuf(rel, blkno, BT_READ);

... skip rest of code...

3) Note that the calls

_bt_relbuf(rel, *bufP);
*bufP = _bt_getbuf(rel, blkno, BT_READ);

a) appear adjacent to each other
b) relbuf calls:

LockBuffer(buf, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buf);

c) getbuf only calls:

buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access);

in the case of an existing buffer, rather than a new one.

4) This could easily be reordered into:

buf = ReadBuffer(rel, blkno); /* pin next page
*/
LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on
current page */
LockBuffer(buf, BT_READ); /* lock next page */
ReleaseBuffer(buf); /* now release pin on
previously current page */

without affecting the logic of the code or causing any deadlock
problems since the release still occurs before the lock of the next
page.

5) A mutex/spinlock that was stored in the index could be acquired by
the scan code like this:

buf = ReadBuffer(rel, blkno); /* pin next page
*/

SpinLockAcquire( indexSpecificMutex ); /* lock the index
reorg mutex */

LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on
current page */
LockBuffer(buf, BT_READ); /* lock next page */

SpinLockRelease( indexSpecificMutex ); /* unlock the index
reorg mutex */

ReleaseBuffer(buf); /* now release pin on
previously current page */

6) The same index specific mutex/spinlock could be used for the merge
code surrounding only the acquisition of the four page locks. This would
obviate any problems with scans and page merges, since the lock
acquisition for the merge could never occur while a scan was between
pages.

Further, with the reordering, the spinlock for the scan code doesn't
seem particularly onerous since it would be surrounding only two LWLock
calls. To reduce the overhead to an absolute minimum for the scan case
these could be pushed down into a new IW call (probably necessary since
the LockBuffer, ReleaseBuffer code does some error checking and such
that one wouldn't want in code guarded by a mutex.

- Curtis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Brown 2003-02-14 18:09:37 Re: location of the configuration files
Previous Message Josh Berkus 2003-02-14 17:36:02 Re: Tuning scenarios (was Changing the default configuration)