Re: Page at a time index scan

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 09:45:44
Message-ID: 1147081544.3468.180.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

On Fri, 2006-05-05 at 18:04 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > On Fri, 5 May 2006, Tom Lane wrote:
> >>> BTW, I just realized another bug in the patch: btbulkdelete fails to
> >>> guarantee that it visits every page in the index.
>
> > The first solution that occurs to me is to force page splits to choose the
> > target page so that it's blkno > the original page's blkno during vacuum.
> > It would cause the index to become more fragmented more quickly, which is
> > bad but perhaps tolerable.
>
> I have a sketch of a solution that doesn't require any change in page
> allocation behavior. Can anyone see any holes in this:
>
> Assume that we have some way to recognize whether a page has been split
> since the current btbulkdelete scan started. (A split would mark both the
> original page and the new right-hand page as newly split.) When
> btbulkdelete arrives at a page, it need take no special action unless the
> page is newly split *and* its right-link points to a lower physical
> address. If that's true, then after vacuuming the page, follow its
> right-link and vacuum that page; repeat until arriving at a page that is
> either not newly split or is above the current location of the outer loop.
> Then return to the outer, sequential-scan loop.
>
> We should also have btbulkdelete clear the newly-split marker whenever
> it processes a page; this ensures that no page is processed more than
> once, which is probably worth the possible extra I/O needed to clear the
> marker. (The cycles to re-scan a page are maybe not that important, but
> if we do reprocess pages we'll end up with a bogusly high tuple count.
> OTOH I'm not sure we can guarantee an accurate tuple count anyway,
> so maybe it doesn't matter.)
>
> AFAICS, this works correctly even if the test for a newly-split page
> sometimes yields false positives; thinking a page is newly split when
> it is not might cost a little extra I/O but won't lead to wrong results.
> This reduces the requirements for the newly-split marking mechanism.
>
> So, how do we do that marking? Noting that we have 16 bits we could use
> in BTPageOpaqueData without making that struct any bigger (because of
> alignment considerations), I'm thinking about a 16-bit counter for each
> index that is incremented at the start of each btbulkdelete operation and
> copied into the opaque data whenever a page is split. If the value's
> different from the current counter, the page definitely hasn't been split
> during btbulkdelete. There's a 1-in-65536 chance of a false positive,
> if the last split occurred some exact multiple of 65536 vacuums ago, but
> that's probably low enough to be acceptable. (We could reduce the odds of
> false positives in various ways, eg by saying that zero isn't a valid
> counter value and having btbulkdelete reset a page's counter to zero
> anytime it has to write the page anyway.)

I read your earlier post about needing to lock everything and spent some
time thinking about this. The issue of needing to lock everything means
that we would never be able to do a partial vacuum of an index i.e.
remove one page without a scan. I'm more concerned about making partial
vacuum work than I am about speeding up an all-block vacuum.

I'd been mulling over the locking problems and came up with something
that looks identical to your proposal *except* for the value that gets
written into the BTPageOpaqueData on the right-hand page.

My thinking was to write the blockid of the original left hand page, so
as to record the original ancestor since split. Thus if multiple splits
occurred, then the original ancestor blockid would remain on record
until VACUUM. In more detail: When we split a page if the ancestor
blockid is not set, then we set it to be the blockid of the old page
(new left hand page). If the ancestor blockid is already set then we
copy that to the new right hand page. Every split will write a value to
BTPageOpaqueData, though the values to use are already available without
extra work.

AFAICS this doesn't change the ability to scan the index in physical
order, so we still get the benefits for that action.

A *partial* vacuum of an index block can be achieved by visiting the
block to be vacuumed, then following the link directly to the pre-split
ancestor block (if there is one), then moving right again back to the
target block. btbulkdelete always clears the marker when it cleans. This
opens the door for the approach of notifying when a page is deletable
and then having a background agent remove just that page, as
patch-proposed recently by Junji TERAMOTO.

I'm *not* presuming we have an agreed solution for partial vacuuming,
but I do want to keep that door open by implementing a locking protocol
that could support it.

> This still has the same sort of locking issues I complained of in
> regards to Heikki's idea, namely how do you make sure that everyone is
> using the new counter value before you start scanning? That seems
> soluble however. We'd just need to be willing to take one more lock
> during every page split operation, which does not seem an intolerable
> amount of overhead. Then we do something like take a sharelock while
> fetching the counter value during split (and hold the lock until the
> split's complete), and take a momentary exclusive lock while advancing
> the counter during btbulkdelete startup.

I'm not very happy about an extra lock during page splitting, which adds
a performance hit even for tables that never will need regular vacuuming
(apart from occaisional wrap-around avoidance).

AFAICS if we just store the ancestor blockid we don't need to do the
business with shared memory and extra locks etc.. The size of the field
we hold for BTPageOpaqueData seems fairly unimportant, since we leave
lots of space in index pages anyway - a few extra bytes would only make
a noise level change in performance.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Martijn van Oosterhout 2006-05-08 14:13:45 Re: [PATCH] Magic block for modules
Previous Message Tom Lane 2006-05-08 03:33:33 Re: pgstat: delayed write of stats file