Re: Remove HeapTuple and Buffer dependency for predicate locking functions

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: Remove HeapTuple and Buffer dependency for predicate locking functions
Date: 2019-08-06 04:20:05
Message-ID: CA+hUKGLF+iE3MVR6LtenbTu4Y_CYbxRZrVzQMSgdzu8stm2YKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 6, 2019 at 4:35 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2019-08-05 20:58:05 +1200, Thomas Munro wrote:
> > 1. Commit dafaa3efb75 "Implement genuine serializable isolation
> > level." (2011) locked the root tuple, because it set t_self to *tid.
> > Possibly due to confusion about the effect of the preceding line
> > ItemPointerSetOffsetNumber(tid, offnum).
> >
> > 2. Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013)
> > fixed that by adding ItemPointerSetOffsetNumber(&heapTuple->t_self,
> > offnum).
>
> Hm. It's not at all sure that it's ok to report the non-root tuple tid
> here. I.e. I'm fairly sure there was a reason to not just set it to the
> actual tid. I think I might have written that up on the list at some
> point. Let me dig in memory and list. Obviously possible that that was
> also obsoleted by parallel changes.

Adding Heikki and Kevin.

I haven't found your earlier discussion about that yet, but would be
keen to read it if you can find it. I wondered if your argument
might have had something to do with heap pruning, but I can't find a
problem there. It's not as though the TID of any visible tuples
change, it's just that some dead stuff goes away and the root line
pointer is changed to LP_REDIRECT if it wasn't already.

As for the argument for locking the tuple we emit rather than the HOT
root, I think the questions are: (1) How exactly do we get away with
locking only one version in a regular non-HOT update chain? (2) Is it
OK to do that with a HOT root?

The answer to the first question is given in README-SSI[1].
Unfortunately it doesn't discuss HOT directly, but I suspect the
answer is no, HOT is not special here. By my reading, it relies on
the version you lock being the version visible to your snapshot, which
is important because later updates have to touch that tuple to write
the next version. That doesn't apply to some arbitrarily older tuple
that happens to be a HOT root. Concretely, heap_update() does
CheckForSerializableConflictIn(relation, &oldtup, buffer), which is
only going to produce a rw conflict if T1 took an SIREAD on precisely
the version T2 locks in that path, not some arbitrarily older version
that happens to be a HOT root. A HOT root might never be considered
again by concurrent writers, no?

As a minor consequence, the optimisation in
CheckTargetForConflictsIn() assumes that a tuple being updated has the
same tag as we locked when reading the tuple, which isn't the case if
we locked the root while reading but now have the TID for the version
we actually read, so in master we leak a tuple lock unnecessarily
until end-of-transaction when we update a HOT tuple.

> > This must be in want of an isolation test, but I haven't yet tried to
> > get my head around how to write a test that would show the difference.
>
> Indeed.

One practical problem is that the only way to reach
PredicateLockTuple() is from an index scan, and the index scan locks
the index page (or the whole index, depending on
rd_indam->ampredlocks). So I think if you want to see a serialization
anomaly you'll need multiple indexes (so that index page locks don't
hide the problem), a long enough HOT chain and then probably several
transactions to be able to miss a cycle that should be picked up by
the logic in [1]. I'm out of steam for this problem today though.

The simple test from the report[3] that resulted in commit 81fbbfe3352
doesn't work for me (ie with permutation "r1" "r2" "w1" "w2" "c1" "c2"
twice in a row). The best I've come up with so far is an assertion
that we predicate-lock the same row version that we emitted to the
user, when reached via an index lookup that visits a HOT row. The
test outputs 'f' for master, but 't' with the change to heapam.c.

[1] Excerpt from README-SSI:

===
* PostgreSQL does not use "update in place" with a rollback log
for its MVCC implementation. Where possible it uses "HOT" updates on
the same page (if there is room and no indexed value is changed).
For non-HOT updates the old tuple is expired in place and a new tuple
is inserted at a new location. Because of this difference, a tuple
lock in PostgreSQL doesn't automatically lock any other versions of a
row. We don't try to copy or expand a tuple lock to any other
versions of the row, based on the following proof that any additional
serialization failures we would get from that would be false
positives:

o If transaction T1 reads a row version (thus acquiring a
predicate lock on it) and a second transaction T2 updates that row
version (thus creating a rw-conflict graph edge from T1 to T2), must a
third transaction T3 which re-updates the new version of the row also
have a rw-conflict in from T1 to prevent anomalies? In other words,
does it matter whether we recognize the edge T1 -> T3?

o If T1 has a conflict in, it certainly doesn't. Adding the
edge T1 -> T3 would create a dangerous structure, but we already had
one from the edge T1 -> T2, so we would have aborted something anyway.
(T2 has already committed, else T3 could not have updated its output;
but we would have aborted either T1 or T1's predecessor(s). Hence
no cycle involving T1 and T3 can survive.)

o Now let's consider the case where T1 doesn't have a
rw-conflict in. If that's the case, for this edge T1 -> T3 to make a
difference, T3 must have a rw-conflict out that induces a cycle in the
dependency graph, i.e. a conflict out to some transaction preceding T1
in the graph. (A conflict out to T1 itself would be problematic too,
but that would mean T1 has a conflict in, the case we already
eliminated.)

o So now we're trying to figure out if there can be an
rw-conflict edge T3 -> T0, where T0 is some transaction that precedes
T1. For T0 to precede T1, there has to be some edge, or sequence of
edges, from T0 to T1. At least the last edge has to be a wr-dependency
or ww-dependency rather than a rw-conflict, because T1 doesn't have a
rw-conflict in. And that gives us enough information about the order
of transactions to see that T3 can't have a rw-conflict to T0:
- T0 committed before T1 started (the wr/ww-dependency implies this)
- T1 started before T2 committed (the T1->T2 rw-conflict implies this)
- T2 committed before T3 started (otherwise, T3 would get aborted
because of an update conflict)

o That means T0 committed before T3 started, and therefore
there can't be a rw-conflict from T3 to T0.

o So in all cases, we don't need the T1 -> T3 edge to
recognize cycles. Therefore it's not necessary for T1's SIREAD lock
on the original tuple version to cover later versions as well.
===

[2] https://www.postgresql.org/message-id/52527E4D.4060302%40vmware.com
[3] https://www.postgresql.org/message-id/flat/523C29A8.20904%40vmware.com

--
Thomas Munro
https://enterprisedb.com

Attachment Content-Type Size
0001-Predicate-lock-the-visible-heap-tuple-not-the-HOT-ro.patch application/octet-stream 4.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-08-06 05:02:52 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Peter Geoghegan 2019-08-06 04:09:49 Re: The unused_oids script should have a reminder to use the 8000-8999 OID range