Quick Links

Re: Brain dump: btree collapsing

From:	mlw <pgsql(at)mohawksoft(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Brain dump: btree collapsing
Date:	2003-02-28 06:29:28
Message-ID:	3E5F01C8.60608@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Just a few questions:

Is it vital that the index shrink? There are three conceptual phases of
a database, right? It is growing, it is shrinking, and it is staying the
same.

In the idea that it is growing, this is not an issue.
In the idea that it is staying the same, surely it must come to some
sort of equilibrium.
In the idea that the DB is shrinking, is this really an issue?

I would say that disk space utilization is far secondary to performance.

Tom Lane wrote:

>I've been thinking hard for the last few days about how to do space
>reclamation in b-tree indexes, i.e., recycle pages that are in
>no-longer-useful portions of the tree structure. We know we need this to
>solve the "index bloat" problem that occurs when the distribution of keys
>changes over time. I feel that it's critical that the reclamation be doable
>by plain VACUUM, ie, without acquiring exclusive lock on the index as a
>whole. This discussion therefore assumes that page deletion must be able
>to operate in parallel with insertions and searches.
>
>
>Issues
>------
>
>We need to get rid of parent links in btree pages; otherwise removal of a
>non-leaf page implies we must find and update all parent links that lead
>to it. This is messy enough that it would be better to do without. The
>only thing the parent link is really needed for is to find the new parent
>level after a root split, and we can handle that (very infrequent) case by
>re-descending from the new root.
>
>Instead of storing parent links, label all pages with level (counting
>levels up from zero = leaf, so that a page's level doesn't change in a
>root split). Then, if we find ourselves needing to re-descend, we can be
>sure of finding the correct parent level, one above where we were, even if
>there's been multiple root splits meanwhile. The level will also make for
>a useful cross-check on link validity: we will always know the level of
>the page we expect to arrive at when following a link, so we can check
>that the page has the right level.
>
>Unfortunately this means tossing out most of the FixBtree code Vadim wrote
>2 years ago, because it seems critically dependent on having parent links.
>But I don't really see why we need it if we rely on WAL to maintain btree
>consistency. That will require some improvements in WAL-logging for
>btrees, however. (Details below.)
>
>When a page is deleted, it can't actually be recycled until there are no
>more potentially in-flight references to it (ie, readers that intend to
>visit the page but have not yet acquired a lock on it). Those readers
>must be able to find the page, realize it's dead, and follow the correct
>sidelink from it. [Lanin&Shasha86] describe the "drain technique", which
>they define as "delay freeing the empty page until the termination of all
>processes whose locate phase began when pointers to the page still
>existed". We can conveniently implement this by reference to
>transactions: after removing all links to the page, label the now-dead
>page with the current contents of the next-transaction-ID counter. VACUUM
>can recycle the page when this is older than the oldest open transaction.
>
>Instead of an actively maintained freelist on disk as per Alvaro Herrera's
>patch, I plan to use the FSM to remember where recyclable pages are, much
>as we do for tables. The FSM space requirements would be small, since
>we'd not be needing to enter any data about partially-full pages; only
>truly empty, recyclable pages would need to be stored. (Is it worth
>having an alternate representation in the FSM for indexes, so that we only
>store page numbers and not the useless amount-free statistic?)
>
>Without a freelist on disk, VACUUM would need to scan indexes linearly to
>find dead pages, but that seems okay; I'm thinking of doing that anyway to
>look for empty pages to turn into dead ones.
>
>
>Restructuring the tree during page deletion
>-------------------------------------------
>
>We will delete only completely-empty pages. If we were to merge nearly-empty
>pages by moving data items from one page to an adjacent one, this would
>imply changing the parent's idea of the bounding key between them ---
>which is okay if we are just deleting an internal key in the parent, but
>what if the pages have different parent pages? We'd have to adjust the
>parents' own bounding key, meaning the parents' parent changes, perhaps
>all the way to the root. (Not to mention that with variable-size keys,
>there's no guarantee we can make such changes without splitting the
>upper-level pages.) And, since we support both forward and backward
>index scans, we can't move leaf items in either direction without risking
>having a concurrent scan miss them. This is way too messy, especially for
>something that has only minimal return according to the literature
>[Johnson89]. So, no merging.
>
>Deletion of an empty page only requires removal of the parent's item
>linking to it (plus fixing side pointers, which is pretty trivial). We
>also remove the next higher key in the parent, which is the parent's upper
>bound for data that would have belonged on the target page. Therefore,
>the page's right sibling becomes responsible for storing the key range
>that used to belong on the deleted page.
>
>What if there is no next-higher key, you ask? Well, I'm going to punt.
>It is critical that the key space associated with a parent page match the key
>space associated with its children (eg, the high key of the rightmost child
>must match the parent's high key). There is no way to atomically modify a
>parent's key space --- this would mean simultaneously changing keys in upper
>levels, perhaps all the way up to the root. To avoid needing to do that, we
>must put a couple of restrictions on deletion:
>
>1. The rightmost page in any tree level is never deleted, period. (This rule
>also simplifies traversal algorithms, as we'll see below.)
>
>2. The rightmost child of a non-rightmost parent page can't be deleted, either,
>unless it is the last child of that parent. If it is the last child then the
>parent is *immediately* marked as half-dead, meaning it can never acquire any
>new children; its key space implicitly transfers to its right sibling. (The
>parent can then be deleted in a later, separate atomic action. Note that if
>the parent is itself a rightmost child of a non-rightmost parent, it will have
>to stay in the half-dead state until it becomes the only child; then it can be
>deleted and its parent becomes half-dead.)
>
>(Note that while leaf pages can be empty and still alive, upper pages
>can't: they must have children to delegate their key range to.)
>
>With these restrictions, there is no need to alter the "high key" fields of
>either the parent or the siblings of a page being deleted. The key space of
>the page itself transfers to its right sibling, and the key space of the
>parent does not change (except in the case where the parent loses all its key
>space to its right sibling and becomes half-dead).
>
>Restriction 1 is not a significant one in terms of space wastage. Restriction
>2 is more annoying, but it should not increase overhead drastically. The
>parent would not be deleted or merged anyway as long as it has other children,
>so the worst-case overhead from this restriction is one extra page per
>parent page --- and parent levels are normally much smaller than child levels.
>
>The notion of a half-dead page means that the key space relationship between
>the half-dead page's level and its parent's level may be a little out of
>whack: key space that appears to belong to the half-dead page's parent on the
>parent level may really belong to its right sibling. We can tolerate this,
>however, because insertions and deletions on upper tree levels are always
>done by reference to child page numbers, not keys. The only cost is that
>searches may sometimes descend to the half-dead page and then have to move
>right, rather than going directly to the sibling page.
>
>
>Page deletion procedure
>-----------------------
>
>Assume we know the target page, but hold no locks. The locks must be
>gotten in this order to avoid deadlock against inserters:
>
>1. Obtain W lock on left sibling of target page (this may require searching
>right if left sibling splits concurrently; same as for backwards scan case).
>Skip this if target is leftmost.
>2. Obtain super-exclusive lock on target page; check it is still empty,
>else abandon deletion.
>3. Obtain W lock on right sibling (there must be one, else we can't delete).
>4. Find and W-lock the target's current parent page. Check that target is
>not rightmost child of a parent with other children, else abandon deletion.
>5. Update all four pages at once, write single WAL record describing all
>updates. (I believe we can extend WAL to allow 4 page references in a
>single record, if not it may be okay to cheat and not store all of the
>adjacent pages. Certainly need not record contents of the target page
>itself, so 3 is enough.) Note parent is marked half-dead here if this
>was its last child. Release locks.
>6. The update to the target page makes it empty and marked dead, but preserves
>its sibling links. It can't actually be recycled until later (see above).
>
>The deletion procedure could be triggered immediately upon removal of the
>last item in a page, or when the next VACUUM scan finds an empty page.
>Not sure yet which way is better.
>
>Having to exclusive-lock four pages is annoying, but I think we must do
>this procedure as a single atomic operation to ensure consistency.
>
>
>Search/scan rules in presence of deletion
>-----------------------------------------
>
>Since page deletion takes a superexclusive lock, a stopped scan will never
>find the page it's on deleted when it resumes. What we do have to worry about
>is arriving at a dead page after following a link. There are several cases:
>
>During initial search for a key, if arrive at a dead or half-dead page, chain
>right. (The key space must have moved to the right.)
>
>During forward scan: likewise chain right. (Note: because scans operate
>only on the leaf level, they should never see half-dead pages.)
>
>During backwards scan: chain right until a live page is found, and step left
>from there in the usual way. We cannot use the dead page's left-link because
>its left neighbor may have split since the page was deleted. (There is
>certain to be at least one live page to the right, since we never delete the
>rightmost page of a level.)
>Also, "step left in the usual way" used to mean this:
> A backwards scan has one additional bit of complexity: after
> following the left-link we must account for the possibility that the
> left sibling page got split before we could read it. So, we have to
> move right until we find a page whose right-link matches the page we
> came from.
>But if the "page we came from" is now dead, perhaps there is no
>page with a matching right-link (if updater got to it before we did)!
>Probably the best policy, if we fail to find a match, is to return to
>the page we were previously on, verify that it's now dead (else error),
>then chain right and left as in the case where we've linked to a dead page
>(see above). This means a backwards scan must always remember the last
>live page it was on. Again, this is greatly simplified by the fact that
>there must be a live page somewhere to the right.
>
>
>Insertion changes
>-----------------
>
>The basic insertion algorithm doesn't change (but see above notes about
>the search part).
>
>bt_getstackbuf is a little trickier in the presence of deletion. First
>issue is that removal of an earlier downlink in the returned-to parent
>page may cause the target item to move left by some number of slots
>(though not into a previous page). We can deal with this by searching
>left after we fail to find the item by searching right, before we move on
>to the next page. Next is that the whole parent page may be dead or
>half-dead --- but it can't be physically deleted, so we can recover by
>following its rightlink. The nasty part is that the target item could
>perhaps not be there; this would imply that someone deleted the page we
>are trying to split, or hasn't fully finished inserting it. But L&Y
>already require holding a lock on that page until we have re-located and
>locked the parent item, so this is not possible. (Note this implies that
>we must search for exactly the downlink to the page we are splitting, but
>that's true anyway.)
>
>If we need to split on a level that was the root when we descended, but
>is no longer the root, then our stack doesn't tell us how to find the next
>level up. As discussed above, handle this case by re-descending and
>checking level fields, rather than depending on parent links as before.
>We will simply descend on the left edge of the tree and scan the whole
>parent level to find where to insert --- it's highly unlikely that the
>parent level has yet grown large enough for this to be slow, so there's
>not much value in trying to use a keyed search.
>
>
>WAL issues
>----------
>
>A page split operation can't really be atomic. We can handle the actual split
>("half-split") operation as an atomic update:
>
>1. We have W lock on page that needs to be split.
>2. Find a free page (from FSM, or make it by extending rel), W-lock it.
>3. W-lock page's right sibling (if any), so that we can fix its left-link.
>4. Update all three pages, make WAL log entry describing same, write pages.
>5. We can now release the W-lock on the right sibling page, but we must keep
> the W-locks on the two split pages until we have made the parent entry;
> else it's conceivable for someone else to try to split or delete these
> pages and get confused because the parent link isn't in place yet.
> (See L&Y.)
>
>(Note that since this is driven by an insert, the half-split WAL log entry
>may as well include the new key insertion step; so this entry substitutes
>for the plain insertion WAL entry we'd otherwise make.)
>
>(So far this is the same as the existing code.)
>
>But now we must find the parent page and enter the new key and downlink
>(possibly causing a split at that level). If crash occurs before we can
>do so, how to recover? The existing code has a lot of ad-hoc logic that
>tries to reconstruct the tree on-the-fly when inconsistency is detected.
>But it would be a lot better if we could expect WAL playback to fix things.
>
>The WAL entry describing the half-split includes enough info to re-execute
>the parent key insertion. While replaying log, we can remember this
>insertion as pending. If we see a matching insertion event in the log,
>discard the remembered pending insertion. At end of log, execute all
>still-pending insertions. (There should not be very many pending
>insertions at any one time, so the resources for this are not onerous.)
>Note however that this implies making more log entries to describe these
>additional updates; might complicate WAL, but I see no fundamental
>difficulty.
>
>Also, if we are splitting a root page, it seems best to merge creation of
>the new root page and updating of the metapage into the same atomic action
>that does the split. This won't make the WAL entry materially larger, and
>it will avoid problems with concurrent root splits. Need to look at
>deadlock issues for trying to update metapage --- is it okay to grab W
>lock on meta while holding it on old root? (Sure, why not? Nothing will
>go the other way.) We do need to update metapage as part of the atomic
>root-split operation, else other updaters trying to find the new root may
>fail because link not there yet.
>
>
>VACUUM mechanics
>----------------
>
>We will institute an explicit "vacuuming an index" phase in VACUUM (this
>is distinct from deleting index entries during vacuuming of a table, and
>will be done after we've finished vacuuming the associated table). In
>this phase, we can scan the index linearly (ie, in storage order not
>index order) to take advantage of sequential I/O optimizations. We scan
>looking for pages that are empty (if leaf pages) or half-dead (if parent
>pages), as well as pages that are already dead and have been so long
>enough to be recycled. The latter are simply reported to the FSM.
>Empty and half-dead pages are deleted per the previously-described
>mechanism. (It seems best to gather the page IDs of target pages in
>memory, and do the deletions only after we've finished the seqscan, so
>as not to confuse the kernel about whether it should read-ahead.)
>
>One step that's not explicit above is how to find the parent of a page we
>intend to delete. It seems most efficient to perform a search for the
>page's high key (it must have one, since we don't delete rightmost pages)
>stopping when we reach the level above the page itself. This will give
>us a good pointer to the parent. We should do this before starting to
>acquire exclusive locks for the deletion.
>
>In theory, if we find recyclable page(s) at the physical end of the index,
>we could truncate the file (ie, give the space back to the filesystem)
>instead of reporting these pages to FSM. I am not sure if this is worth
>doing --- in most cases it's likely that little space can be released this
>way, and there may be some tricky locking issues.
>
>
>Tree depth
>----------
>
>Because we never relabel pages' levels, the tree depth cannot be reduced (we'd
>have to do that by removing the current root page, which seems impractical
>without an exclusive lock on the whole index). So after massive shrinkage
>we could end up with a "thin" tree in which there are levels below the root
>with only one page apiece. The space wastage doesn't seem like a big issue,
>but the extra time to traverse these useless levels could be annoying.
>
>This could be ignored in first implementation (there's always REINDEX).
>Later, possibly handle it via Lanin&Shasha's notion of a critic (think
>VACUUM) that sets a fast pointer to the current effective root level.
>(Actually I think we wouldn't need a separate critic process; split and
>delete steps could be programmed to update the fast pointer for
>themselves, in a separate atomic action, when they split a one-page level
>or delete the next-to-last page of a level.)
>
>
>Any comments before I plunge into implementing this?
>
> regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>message can get through to the mailing list cleanly
>
>
>

In response to

Brain dump: btree collapsing at 2003-02-12 22:42:44 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Christopher Kings-Lynne	2003-02-28 07:50:21	Are my emails getting through at all???
Previous Message	Christopher Kings-Lynne	2003-02-28 05:25:45	New format for psql (repost)