Re: snapshot too old, configured by time

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Kevin Grittner <kgrittn(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Kevin Grittner <kgrittn(at)ymail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: snapshot too old, configured by time
Date: 2016-04-05 02:15:20
Message-ID: CAM3SWZQdSPVsVL_jjjKVQ0kd8OskLPZFzXFrt-DmzSOgFuRtew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 30, 2016 at 12:46 PM, Kevin Grittner <kgrittn(at)gmail(dot)com> wrote:
> On Wed, Mar 30, 2016 at 2:29 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>
>> [Does the patch allow dangling page pointers?]
>
>> Again, I don't want to prejudice anyone against your patch, which I
>> haven't read.
>
> I don't believe that the way the patch does its business opens any
> new vulnerabilities of this type. If you see such after looking at
> the patch, let me know.

Okay, let me be more concrete about this. The patch does this:

> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -92,12 +92,21 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
> * need to use the horizon that includes slots, otherwise the data-only
> * horizon can be used. Note that the toast relation of user defined
> * relations are *not* considered catalog relations.
> + *
> + * It is OK to apply the old snapshot limit before acquiring the cleanup
> + * lock because the worst that can happen is that we are not quite as
> + * aggressive about the cleanup (by however many transaction IDs are
> + * consumed between this point and acquiring the lock). This allows us to
> + * save significant overhead in the case where the page is found not to be
> + * prunable.
> */
> if (IsCatalogRelation(relation) ||
> RelationIsAccessibleInLogicalDecoding(relation))
> OldestXmin = RecentGlobalXmin;
> else
> - OldestXmin = RecentGlobalDataXmin;
> + OldestXmin =
> + TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
> + relation);

This new intermediary function TransactionIdLimitedForOldSnapshots()
is called to decide what OldestXmin actually gets to be above, based
in part on the new GUC old_snapshot_threshold:

> +/*
> + * TransactionIdLimitedForOldSnapshots
> + *
> + * Apply old snapshot limit, if any. This is intended to be called for page
> + * pruning and table vacuuming, to allow old_snapshot_threshold to override
> + * the normal global xmin value. Actual testing for snapshot too old will be
> + * based on whether a snapshot timestamp is prior to the threshold timestamp
> + * set in this function.
> + */
> +TransactionId
> +TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
> + Relation relation)

It might not be RecentGlobalDataXmin that is usually returned as
OldestXmin as it is today, which is exactly the point of this patch:
VACUUM can be more aggressive in cleaning up bloat, not unlike the
non-catalog logical decoding case, on the theory that we can reliably
detect when that causes failures for old snapshots, and just raise a
"snapshot too old" error. (RecentGlobalDataXmin is morally about the
same as RecentGlobalXmin, as far as this patch goes).

So far, so good. It's okay that _bt_page_recyclable() never got the
memo about any of this...:

/*
* _bt_page_recyclable() -- Is an existing page recyclable?
*
* This exists to make sure _bt_getbuf and btvacuumscan have the same
* policy about whether a page is safe to re-use.
*/
bool
_bt_page_recyclable(Page page)
{
BTPageOpaque opaque;

...

/*
* Otherwise, recycle if deleted and too old to have any processes
* interested in it.
*/
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_ISDELETED(opaque) &&
TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
return true;
return false;
}

...because this patch does nothing to advance RecentGlobalXmin (or
RecentGlobalDataXmin) itself more aggressively. It does make
vacuum_set_xid_limits() get a more aggressive cutoff point, but we
don't see that being passed back down by lazy vacuum here; within
_bt_page_recyclable(), we rely on the more conservative
RecentGlobalXmin, which is not subject to any clever optimization in
the patch.

Fortunately, this seems correct, since index scans will always succeed
in finding a deleted page, per nbtree README notes on
RecentGlobalXmin. Unfortunately, this does stop recycling from
happening early for B-Tree pages, even though that's probably safe in
principle. This is probably not so bad -- it just needs to be
considered when reviewing this patch (the same is true of logical
decoding's RecentGlobalDataXmin; it also doesn't appear in
_bt_page_recyclable(), and I guess that that was never a problem).
Index relations will not get smaller in some important cases, but they
will be made less bloated by VACUUM in a sense that's still probably
very useful. Maybe that explains some of what Jeff talked about.

I think another part of the problems that Jeff mentioned (with
pruning) could be this existing code within heap_hot_search_buffer():

/*
* If we can't see it, maybe no one else can either. At caller
* request, check whether all chain members are dead to all
* transactions.
*/
if (all_dead && *all_dead &&
!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
*all_dead = false;

This is used within routines like btgettuple(), to do the LP_DEAD
thing to kill index tuples (not HOT chain pruning).

Aside: Not sure offhand why it might be okay, performance-wise, that
this code doesn't care about RecentGlobalDataXmin; pruning was a big
part of why RecentGlobalDataXmin was added for logical decoding, I
thought, although I guess the _bt_killitems() stuff doesn't count as
pruning.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2016-04-05 02:25:30 Re: dealing with extension dependencies that aren't quite 'e'
Previous Message Kyotaro HORIGUCHI 2016-04-05 02:03:38 Re: postgres_fdw : altering foreign table not invalidating prepare statement execution plan.