Re: Page at a time index scan

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 22:04:04
Message-ID: 28521.1146866644@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Fri, 5 May 2006, Tom Lane wrote:
>>> BTW, I just realized another bug in the patch: btbulkdelete fails to
>>> guarantee that it visits every page in the index.

> The first solution that occurs to me is to force page splits to choose the
> target page so that it's blkno > the original page's blkno during vacuum.
> It would cause the index to become more fragmented more quickly, which is
> bad but perhaps tolerable.

I have a sketch of a solution that doesn't require any change in page
allocation behavior. Can anyone see any holes in this:

Assume that we have some way to recognize whether a page has been split
since the current btbulkdelete scan started. (A split would mark both the
original page and the new right-hand page as newly split.) When
btbulkdelete arrives at a page, it need take no special action unless the
page is newly split *and* its right-link points to a lower physical
address. If that's true, then after vacuuming the page, follow its
right-link and vacuum that page; repeat until arriving at a page that is
either not newly split or is above the current location of the outer loop.
Then return to the outer, sequential-scan loop.

We should also have btbulkdelete clear the newly-split marker whenever
it processes a page; this ensures that no page is processed more than
once, which is probably worth the possible extra I/O needed to clear the
marker. (The cycles to re-scan a page are maybe not that important, but
if we do reprocess pages we'll end up with a bogusly high tuple count.
OTOH I'm not sure we can guarantee an accurate tuple count anyway,
so maybe it doesn't matter.)

AFAICS, this works correctly even if the test for a newly-split page
sometimes yields false positives; thinking a page is newly split when
it is not might cost a little extra I/O but won't lead to wrong results.
This reduces the requirements for the newly-split marking mechanism.

So, how do we do that marking? Noting that we have 16 bits we could use
in BTPageOpaqueData without making that struct any bigger (because of
alignment considerations), I'm thinking about a 16-bit counter for each
index that is incremented at the start of each btbulkdelete operation and
copied into the opaque data whenever a page is split. If the value's
different from the current counter, the page definitely hasn't been split
during btbulkdelete. There's a 1-in-65536 chance of a false positive,
if the last split occurred some exact multiple of 65536 vacuums ago, but
that's probably low enough to be acceptable. (We could reduce the odds of
false positives in various ways, eg by saying that zero isn't a valid
counter value and having btbulkdelete reset a page's counter to zero
anytime it has to write the page anyway.)

This still has the same sort of locking issues I complained of in
regards to Heikki's idea, namely how do you make sure that everyone is
using the new counter value before you start scanning? That seems
soluble however. We'd just need to be willing to take one more lock
during every page split operation, which does not seem an intolerable
amount of overhead. Then we do something like take a sharelock while
fetching the counter value during split (and hold the lock until the
split's complete), and take a momentary exclusive lock while advancing
the counter during btbulkdelete startup.

Thoughts, better ideas?

regards, tom lane

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2006-05-06 02:24:39 pgsql: Add SSL CRL support to libpq.
Previous Message Tom Lane 2006-05-05 19:13:04 Re: Page at a time index scan