Skip site navigation (1) Skip section navigation (2)


From: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: teodor(at)sigaev(dot)ru, oleg(at)sai(dot)msu(dot)su, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: VACUUM in SP-GiST
Date: 2012-01-16 21:43:35
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers

> I started reading the spgist vacuum code last night, and didn't like it
> at all.  I found a number of smaller issues, but it seems to me that
> the design is just fundamentally wrong.  Unless I'm misunderstanding it,
> the approach is to recursively traverse the tree in sort of the same way
> that a full-index search would do, except that it continues to hold lock
> on the immediate parent page of whichever page is currently being
> visited, so that it can update the downlink if it has to delete the
> first leaf tuple of a chain.  I've got several complaints about that:
> 1. Holding exclusive locks on upper pages while we visit
> possibly-hundreds of leaf pages seems pretty bad from a concurrency
> standpoint.  It doesn't hold locks all the way up to the root, so it's
> not a complete loss, but even a first-level inner page can have a lot of
> children.
> 2. I do not trust this code to visit every leaf page (which, if it
> failed to do so and thereby missed deleting a heap reference, would
> result in a silently corrupt index).  Unlike a regular index search,
> which makes a list of lower-level targets while examining an inner tuple
> just once, this code depends on the assumption that it can reacquire
> lock on an upper page and then resychronize itself with whatever's been
> done meanwhile to the inner tuple it was looking at.  I don't trust
> that.  The mechanism that's being used relies on page LSN having been
> changed if anything was changed on the page, which does not work in
> unlogged tables.  Furthermore, if the inner tuple did change, it has to
> rescan *everything* that the inner tuple leads to.  That's horrendously
> inefficient, and it's not even clear that it would ever terminate in the
> face of a constant stream of concurrent updates.
> 3. Even if it all worked, it would be slow as can be, because it
> requires a whole lot of nonsequential accesses to the index.  For
> instance, the same leaf page might be visited many times (once for
> each leaf chain on the page), and not necessarily close together either.
> Also, the code doesn't seem to perform any cleanup at all on inner
> pages.  I am not expecting it to try to get rid of childless inner
> tuples, but surely it ought to at least convert old redirects to dead
> and get rid of dead tuples at the end of the page, much as for leaf
> pages?
> What I'm envisioning doing instead is making a single sequential scan
> over the entire index.  On both leaf and inner pages we could clean
> redirects and remove end dead tuples.  On leaf pages we'd also check for
> tuples to be deleted.  There's no real difficulty in removing deleted
> tuples as long as they are not at the heads of their chains.  (The code
> would have to reverse-engineer which tuples are at the heads of their
> chains, but that's easily done while scanning the page; we just make a
> reverse-lookup map for the nextOffset pointers.)  If we have to delete a
> tuple at the head of its chain, but there's still at least one live
> tuple in its chain, we could reshuffle the tuples to bring another live
> tuple to that same offset number, thereby not invalidating the parent
> downlink.  The only hard part is what to do when we delete all members
> of a chain: we can't reset the parent downlink because we do not know
> where it is.  What I propose to do in that case is install a variant
> form of dead tuple that just indicates "this chain is empty".  One way
> to represent that would be a redirect tuple pointing nowhere, but I
> think it would be cleaner to repurpose the isDead and isRedirect bits as
> a two-bit field with four states.  We'd have LIVE, DEAD, REDIRECT, and
> this fourth state for a dead tuple that is not recyclable because we
> know there's a link to it somewhere.  A scan would ignore such a tuple.
> An insertion that arrives at such a tuple could either replace it (if
> there's enough room on the page) or convert it to a redirect (if not).
> Last night I was thinking the fourth state could be named TOMBSTONE,
> but maybe it'd be better to use DEAD for this case and rename the
> existing "removable dead tuple" state to PLACEHOLDER, since that case
> only exists to preserve the offset numbers of other tuples on the
> page.
> Comments?

with a concurrent split moving leaf tuples between pages, can't it make
bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY
create duplicates?


> 			regards, tom lane
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:

In response to


pgsql-hackers by date

Next:From: Alvaro HerreraDate: 2012-01-16 22:08:09
Subject: Re: automating CF submissions (was xlog location arithmetic)
Previous:From: Josh BerkusDate: 2012-01-16 21:38:15
Subject: Re: Why is CF 2011-11 still listed as "In Progress"?

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group