Re: Page at a time index scan

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-06 20:29:29
Message-ID: Pine.OSF.4.61.0605062303550.189201@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

On Fri, 5 May 2006, Tom Lane wrote:

> I have a sketch of a solution that doesn't require any change in page
> allocation behavior. Can anyone see any holes in this:

Looks good to me.

> Assume that we have some way to recognize whether a page has been split
> since the current btbulkdelete scan started. (A split would mark both the
> original page and the new right-hand page as newly split.) When
> btbulkdelete arrives at a page, it need take no special action unless the
> page is newly split *and* its right-link points to a lower physical
> address. If that's true, then after vacuuming the page, follow its
> right-link and vacuum that page; repeat until arriving at a page that is
> either not newly split or is above the current location of the outer loop.
> Then return to the outer, sequential-scan loop.

It'd be a bit more efficient to finish the sequential-scan first, and
memorize the newly-split pages' right-links as they're encountered. Then
scan those pages as a separate second pass, or earlier if we run out of
memory reserved for memorizing them.

> We should also have btbulkdelete clear the newly-split marker whenever
> it processes a page; this ensures that no page is processed more than
> once, which is probably worth the possible extra I/O needed to clear the
> marker. (The cycles to re-scan a page are maybe not that important, but
> if we do reprocess pages we'll end up with a bogusly high tuple count.
> OTOH I'm not sure we can guarantee an accurate tuple count anyway,
> so maybe it doesn't matter.)

Yeah, seems worth it to always clear the marker. Definitely when the page
is dirtied anyway.

> AFAICS, this works correctly even if the test for a newly-split page
> sometimes yields false positives; thinking a page is newly split when
> it is not might cost a little extra I/O but won't lead to wrong results.
> This reduces the requirements for the newly-split marking mechanism.
>
> So, how do we do that marking? Noting that we have 16 bits we could use
> in BTPageOpaqueData without making that struct any bigger (because of
> alignment considerations), I'm thinking about a 16-bit counter for each
> index that is incremented at the start of each btbulkdelete operation and
> copied into the opaque data whenever a page is split. If the value's
> different from the current counter, the page definitely hasn't been split
> during btbulkdelete. There's a 1-in-65536 chance of a false positive,
> if the last split occurred some exact multiple of 65536 vacuums ago, but
> that's probably low enough to be acceptable. (We could reduce the odds of
> false positives in various ways, eg by saying that zero isn't a valid
> counter value and having btbulkdelete reset a page's counter to zero
> anytime it has to write the page anyway.)

If btbulkdelete always clears the marker (assuming zero isn't a valid
value), 16 bits is plenty. Unless a vacuum is aborted, there should never
be a value older than current value - 1 in the index. We could live with a
2-bit counter.

> This still has the same sort of locking issues I complained of in
> regards to Heikki's idea, namely how do you make sure that everyone is
> using the new counter value before you start scanning? That seems
> soluble however. We'd just need to be willing to take one more lock
> during every page split operation, which does not seem an intolerable
> amount of overhead. Then we do something like take a sharelock while
> fetching the counter value during split (and hold the lock until the
> split's complete), and take a momentary exclusive lock while advancing
> the counter during btbulkdelete startup.

That's not too bad. Where exactly were you thinking of putting the
counter and the lock?

> Thoughts, better ideas?

Well, we could enhance my proposal a bit to make the fragmentation effect
less severe. Instead of a simple flag indicating that a vacuum is in
progress, vacuum could announce the current page it's processing in a
per-index shmem variable. A page split must read that counter, holding a
shared lock, and choose the target page so that that boundary is not
crossed so that the new page is below the boundary and old page above.
Otherwise, it's free to choose what it wants. To make good target page
choices, it needs some help from FSM.

I think I like your proposal more, though. It seems better
concurrency-wise.

- Heikki

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2006-05-06 21:00:24 Re: Page at a time index scan
Previous Message Martijn van Oosterhout 2006-05-06 19:21:29 Re: [PATCH] Add support for GnuTLS