Re: snapshot too old issues, first around wraparound and then more.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: snapshot too old issues, first around wraparound and then more.
Date: 2020-04-03 00:12:35
Message-ID: 20200403001235.e6jfdll3gh2ygbuc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I just spend a good bit more time improving my snapshot patch, so it
could work well with a fixed version of the old_snapshot_threshold
feature. Mostly so there's no unnecessary dependency on the resolution
of the issues in that patch.

When testing my changes, for quite a while, I could not get
src/test/modules/snapshot_too_old/ to trigger a single too-old error.

It turns out, that's because there's not a single tuple removed due to
old_snapshot_threshold in src/test/modules/snapshot_too_old/. The only
reason the current code triggers any such errors is that

a) TransactionIdLimitedForOldSnapshots() is always called in
heap_page_prune_opt(), even if the "not limited" horizon
(i.e. RecentGlobalDataXmin) is more than old enough to allow for
pruning. That includes pages without a pd_prune_xid.

b) TransactionIdLimitedForOldSnapshots(), in the old_snapshot_threshold
== 0 branch, always calls
SetOldSnapshotThresholdTimestamp(ts, xlimit)
even if the horizon has not changed due to snapshot_too_old (xlimit
is initially set tot the "input" horizon, and only increased if
between (recentXmin, MyProc->xmin)).

To benefit from the snapshot scalability improvements in my patchset, it
is important to avoid unnecessarily determining the "accurate" xmin
horizon, if it's clear from the "lower boundary" horizon that pruning
can happen. Therefore I changed heap_page_prune_opt() and
heap_page_prune() to only limit when we couldn't prune.

In the course of that I separated getting the horizon from
TransactionIdLimitedForOldSnapshots() and triggering errors when an
already removed tuple would be needed via
TransactionIdLimitedForOldSnapshots().

Because there are no occasions to actually remove tuples in the entire
test, there now were no TransactionIdLimitedForOldSnapshots() calls. And
thus no errors. My code turns out to actually work.

Thus, if I change the code in master from:
TransactionId xlimit = recentXmin;
...
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
xlimit = latest_xmin;

ts -= 5 * USECS_PER_SEC;
SetOldSnapshotThresholdTimestamp(ts, xlimit);

return xlimit;
}

to
...
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
{
xlimit = latest_xmin;
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}

ts -= 5 * USECS_PER_SEC;

return xlimit;
}

there's not a single error raised in the existing tests. Not a *single*
tuple removal is caused by old_snapshot_threshold. We just test the
order of SetOldSnapshotThresholdTimestamp() calls. We have code in the
backend to support testing old_snapshot_threshold, but we don't test
anything meaningful in the feature. We basically test a oddly behaving
version version of "transaction_timeout = 5s". I can't emphasize enough
how baffling I find this.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-03 00:17:13 Re: snapshot too old issues, first around wraparound and then more.
Previous Message Tom Lane 2020-04-02 23:44:34 Re: Berserk Autovacuum (let's save next Mandrill)