Re: On-the-fly index tuple deletion vs. hot_standby

From: Noah Misch <noah(at)leadboat(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: On-the-fly index tuple deletion vs. hot_standby
Date: 2011-03-12 10:40:39
Message-ID: 20110312104039.GA21992@tornado.gateway.2wire.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote:
> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
> > On 29.11.2010 08:10, Noah Misch wrote:
> > > I have a hot_standby system and use it to bear the load of various reporting
> > > queries that take 15-60 minutes each. In an effort to avoid long pauses in
> > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
> > > the master's transactions. Even so, I kept seeing recovery pause for the
> > > duration of a long-running query. In each case, the culprit record was an
> > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The
> > > attached test script demonstrates the behavior (on HEAD); the index tuple
> > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.
> > >
> > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and remove
> > > the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid does
> > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
> > > with snapshots having visibility over that transaction. Could we correctly
> > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
> > > of aborted transactions and tuples inserted and deleted within one transaction?
>
> @Noah Easily the best bug reported submitted in a long time. Thanks.
>
> > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need
> > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid()
> > could just call HeapTupleHeaderAdvanceLatestRemoveXid().
>
> Yes, it applies to other cases also. Thanks for the suggestion.
>
> Fix committed. Please double-check my work, committed early since I'm
> about to jump on a plane.

The installation that inspired my original report recently upgraded from 9.0.1
to 9.0.3, and your fix did significantly decrease its conflict frequency. The
last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
(FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've
attached a test script demonstrating the behavior. _bt_page_recyclable approves
any page deleted no more recently than RecentXmin, because we need only ensure
that every ongoing scan has witnessed the page as dead. For the hot standby
case, we need to account for possibly-ongoing standby transactions. Using
RecentGlobalXmin covers that, albeit with some pessimism: we really only need
LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
- vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay
off, though. Thoughts?

Thanks,
nm

Attachment Content-Type Size
repro-btree-reuse.sh application/x-sh 1.8 KB
standby-bt-reuse.patch text/plain 894 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Farina 2011-03-12 11:59:53 Re: Fwd: index corruption in PG 8.3.13
Previous Message Dave Page 2011-03-12 09:48:02 Re: template0 database comment