Re: VACUUM in SP-GiST

From: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: teodor(at)sigaev(dot)ru, oleg(at)sai(dot)msu(dot)su, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: VACUUM in SP-GiST
Date: 2012-01-16 21:43:35
Message-ID: 20120116214335.D88C414A22B@mail.netbsd.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

hi,

> I started reading the spgist vacuum code last night, and didn't like it
> at all. I found a number of smaller issues, but it seems to me that
> the design is just fundamentally wrong. Unless I'm misunderstanding it,
> the approach is to recursively traverse the tree in sort of the same way
> that a full-index search would do, except that it continues to hold lock
> on the immediate parent page of whichever page is currently being
> visited, so that it can update the downlink if it has to delete the
> first leaf tuple of a chain. I've got several complaints about that:
>
> 1. Holding exclusive locks on upper pages while we visit
> possibly-hundreds of leaf pages seems pretty bad from a concurrency
> standpoint. It doesn't hold locks all the way up to the root, so it's
> not a complete loss, but even a first-level inner page can have a lot of
> children.
>
> 2. I do not trust this code to visit every leaf page (which, if it
> failed to do so and thereby missed deleting a heap reference, would
> result in a silently corrupt index). Unlike a regular index search,
> which makes a list of lower-level targets while examining an inner tuple
> just once, this code depends on the assumption that it can reacquire
> lock on an upper page and then resychronize itself with whatever's been
> done meanwhile to the inner tuple it was looking at. I don't trust
> that. The mechanism that's being used relies on page LSN having been
> changed if anything was changed on the page, which does not work in
> unlogged tables. Furthermore, if the inner tuple did change, it has to
> rescan *everything* that the inner tuple leads to. That's horrendously
> inefficient, and it's not even clear that it would ever terminate in the
> face of a constant stream of concurrent updates.
>
> 3. Even if it all worked, it would be slow as can be, because it
> requires a whole lot of nonsequential accesses to the index. For
> instance, the same leaf page might be visited many times (once for
> each leaf chain on the page), and not necessarily close together either.
>
> Also, the code doesn't seem to perform any cleanup at all on inner
> pages. I am not expecting it to try to get rid of childless inner
> tuples, but surely it ought to at least convert old redirects to dead
> and get rid of dead tuples at the end of the page, much as for leaf
> pages?
>
> What I'm envisioning doing instead is making a single sequential scan
> over the entire index. On both leaf and inner pages we could clean
> redirects and remove end dead tuples. On leaf pages we'd also check for
> tuples to be deleted. There's no real difficulty in removing deleted
> tuples as long as they are not at the heads of their chains. (The code
> would have to reverse-engineer which tuples are at the heads of their
> chains, but that's easily done while scanning the page; we just make a
> reverse-lookup map for the nextOffset pointers.) If we have to delete a
> tuple at the head of its chain, but there's still at least one live
> tuple in its chain, we could reshuffle the tuples to bring another live
> tuple to that same offset number, thereby not invalidating the parent
> downlink. The only hard part is what to do when we delete all members
> of a chain: we can't reset the parent downlink because we do not know
> where it is. What I propose to do in that case is install a variant
> form of dead tuple that just indicates "this chain is empty". One way
> to represent that would be a redirect tuple pointing nowhere, but I
> think it would be cleaner to repurpose the isDead and isRedirect bits as
> a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and
> this fourth state for a dead tuple that is not recyclable because we
> know there's a link to it somewhere. A scan would ignore such a tuple.
> An insertion that arrives at such a tuple could either replace it (if
> there's enough room on the page) or convert it to a redirect (if not).
>
> Last night I was thinking the fourth state could be named TOMBSTONE,
> but maybe it'd be better to use DEAD for this case and rename the
> existing "removable dead tuple" state to PLACEHOLDER, since that case
> only exists to preserve the offset numbers of other tuples on the
> page.
>
> Comments?

with a concurrent split moving leaf tuples between pages, can't it make
bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY
create duplicates?

YAMAMOTO Takashi

>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2012-01-16 22:08:09 Re: automating CF submissions (was xlog location arithmetic)
Previous Message Josh Berkus 2012-01-16 21:38:15 Re: Why is CF 2011-11 still listed as "In Progress"?