Re: Bugs in b-tree dead page removal

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bugs in b-tree dead page removal
Date: 2010-02-08 12:01:22
Message-ID: 4B6FFD12.7060908@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Whilst looking around for stuff that could be deleted thanks to removing
> old-style VACUUM FULL, I came across some code in btree that seems
> rather seriously buggy. For reasons explained in nbtree/README, we
> can't physically recycle a "deleted" btree index page until all
> transactions open at the time of deletion are gone --- otherwise we
> might re-use a page that an existing scan is about to land on, and
> confuse that scan. (This condition is overly strong, of course, but
> it's what's designed in at the moment.) The way this is implemented
> is to label a freshly-deleted page with the current value of
> ReadNewTransactionId(). Once that value is older than RecentXmin,
> the page is presumed recyclable.
>
> I think this was all right when it was designed, but isn't it rather
> badly broken by our subsequent changes to have transactions not take
> out an XID until/unless they write something? A read-only transaction
> could easily be much older than RecentXmin, no?

Yeah :-(.

> The odds of an actual problem seem not very high, since to be affected
> a scan would have to be already "in flight" to the problem page when
> the deletion occurs. By the time RecentXmin advances and we feed the
> page to the FSM and get it back, the scan's almost surely going to have
> arrived. And I think the logic is such that this would not happen
> before the next VACUUM in any case. Still, it seems pretty bogus.

One idea is to change the traversal logic slightly, so that whenever you
follow a pointer from page A to B, you keep A pinned until you've pinned
B. Then we could just take cleanup lock on the left, right and parent of
the empty page, one at a time, to ensure that no-one is in-flight to it,
and recycle it immediately. However, forward scans screw that up. When a
forward scan reads a page, it saves its right pointer at the same time,
so that if the page is subsequently split, it doesn't process tuples on
the new right half again. Even if we take cleanup lock on the left
sibling of an empty page, there can be scans further left that have a
direct pointer to the empty page. It could be salvaged if we require a
forward scan to pin the next page too when it saves the right pointer,
but that seems inefficient.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-02-08 13:15:32 Re: Pathological regexp match
Previous Message Massa, Harald Armin 2010-02-08 11:55:11 Re: Confusion over Python drivers