Skip site navigation (1) Skip section navigation (2)

Re: foreign key locks, 2nd attempt

From: Noah Misch <noah(at)leadboat(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: foreign key locks, 2nd attempt
Date: 2011-12-04 12:20:27
Message-ID: 20111204122027.GA10035@tornado.leadboat.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Hi Alvaro,

On Thu, Nov 03, 2011 at 03:12:49PM -0300, Alvaro Herrera wrote:
> After some rather extensive rewriting, I submit the patch to improve
> foreign key locks.

I've reviewed this patch.  The basic design and behaviors are sound.  All the
bugs noted in my previous review are gone.

Making pg_multixact persistent across clean shutdowns is no bridge to cross
lightly, since it means committing to an on-disk format for an indefinite
period.  We should do it; the benefits of this patch justify it, and I haven't
identified a way to avoid it without incurring worse problems.

FWIW, I pondered a dead-end alternate idea of having every MultiXactId also be
a SubTransactionId.  That way, you could still truncate pg_multixact early,
with the subtransaction commit status being adequate going forward.  However,
for the case when the locker arrives after the updater, this would require the
ability to create a new subtransaction on behalf of a different backend.  It
would also burn the xid space more quickly, slow commits affected by the added
subtransaction load, and aggravate "suboverflowed" incidence.


I did some halfhearted benchmarking to at least ensure the absence of any
gross performance loss on at-risk operations.  Benchmarks done with a vanilla
build, -O2, no --enable-cassert.  First, to exercise the cost of comparing
large column values an extra time, I created a table with a 2000-byte key
column and another int4 column.  I then did a HOT update of every tuple.  The
patch did not significantly change runtime.  See attached fklock-wide.sql for
the commands run and timings collected.

Second, I tried a SELECT FOR SHARE on a table of 1M tuples; this might incur
some cost due to the now-guaranteed use of pg_multixact for FOR SHARE.  See
attached fklock-test-forshare.sql.  The median run slowed by 7% under the
patch, albeit with a rather brief benchmark run.  Both master and patched
PostgreSQL seemed to exhibit a statement-scope memory leak in this test case:
to lock 1M rows, backend-private memory grew by about 500M.  When trying 10M
rows, I cancelled the query after 1.2 GiB of consumption.  This limited the
duration of a convenient test run.

I planned to benchmark the overhead of the HeapTupleSatisfiesMVCC() changes
when no foreign keys are in use, but I did not get around to that.


For anyone else following along, here are some important past threads:
http://archives.postgresql.org/message-id/1294953201-sup-2099@alvh.no-ip.org
http://archives.postgresql.org/message-id/20110211071322.GB26971@tornado.leadboat.com
http://archives.postgresql.org/message-id/1312907125-sup-9346@alvh.no-ip.org
http://archives.postgresql.org/message-id/cmdap.323308e530.1315601945-sup-7377@alvh.no-ip.org
http://archives.postgresql.org/message-id/1317053656-sup-7193@alvh.no-ip.org
http://archives.postgresql.org/message-id/1317840445-sup-7142@alvh.no-ip.org

> So Noah Misch proposed using the FOR KEY SHARE syntax, and that's what I
> have implemented here.  (There was some discussion that instead of
> inventing new SQL syntax we could pass the necessary lock mode 
> internally in the ri_triggers code.  That can still be done of course, 
> though I haven't done so in the current version of the patch.)

From a UI perspective, I'd somewhat rather we exposed not only FOR KEY SHARE,
but also FOR KEY UPDATE, in the grammar.  That lets us document the tuple lock
conflict table entirely in terms of other documented tuple locks.

> The other user-visible pending item is that it was said that instead of
> simply using "columns used by unique indexes" as the key columns 
> considered by this patch, we should do some ALTER TABLE command.  This 
> would be a comparatively trivial undertaking, I think, but I would like
> there to be consensus on that this is really the way to go before I
> implement it.

As I mentioned in my last review, the base heuristic should be to select
attributes actually referenced by some foreign key constraint.  We don't gain
enough by automatically involving columns of indexes that could, but do not,
support an actual FK constraint.

I see value in that ALTER TABLE proposal, as a follow-on patch, for the
benefit of user-defined referential integrity constraints.  Users ought to get
the benefit of the core patch automatically, not just when they know to mark
every key column.  For that reason, the key columns defined through ALTER
TABLE should add to, not replace, the automatically-identified set.

> The new code mostly works fine but I'm pretty sure there must be serious
> bugs somewhere.  Also, in places, heap_update and heap_lock_tuple have
> become spaguettish, though I'm not sure I see better ways to write them.

Agreed, but I'm also short on ideas for rapidly and significantly improving
the situation.  If I were going to try something, I'd try splitting out the
lock-acquisition loop into a separate function, preferably shared between
heap_update() and heap_lock_tuple().


Starting with source clusters of the catversion introducing this feature,
pg_upgrade must copy pg_multixact to the new cluster.  The patch does not add
code for this.

This patch adds no user documentation.  We need to document FOR KEY SHARE.  I
don't see any current documentation of the tuple lock consequences of foreign
key constraints, so we don't need any changes there.

Should we add a few (2-6) unused flag bits to each multixact member to provide
growing room for future needs?

Does this patch have any special implications for REPEATABLE READ?

> --- a/contrib/pgrowlocks/pgrowlocks.c
> +++ b/contrib/pgrowlocks/pgrowlocks.c

I used pgrowlocks to play with this patch, and the output clarity seems to
have fallen somewhat:

**** w/ patch
-- SELECT * FROM test_rowlock FOR KEY SHARE;
 locked_row | lock_type | locker | multi |  xids   | modes |  pids   
------------+-----------+--------+-------+---------+-------+---------
 (0,1)      | KeyShare  |  15371 | f     | {15371} |       | {27276}
-- SELECT * FROM test_rowlock FOR SHARE;
 locked_row |  lock_type   | locker | multi |  xids   | modes |  pids   
------------+--------------+--------+-------+---------+-------+---------
 (0,1)      | IsNotUpdate  |     70 | t     | {15372} | {shr} | {27276}
-- SELECT * FROM test_rowlock FOR UPDATE;
 locked_row |       lock_type        | locker | multi |  xids   |  modes   |  pids   
------------+------------------------+--------+-------+---------+----------+---------
 (0,1)      | Exclusive IsNotUpdate  |     71 | t     | {15373} | {forupd} | {27276}
-- UPDATE test_rowlock SET non_key_col = 11;
 locked_row | lock_type | locker | multi |  xids   | modes |  pids   
------------+-----------+--------+-------+---------+-------+---------
 (0,1)      |           |  15374 | f     | {15374} |       | {27276}
-- UPDATE test_rowlock SET key_col = 2;
 locked_row | lock_type | locker | multi |  xids   | modes |  pids   
------------+-----------+--------+-------+---------+-------+---------
 (0,1)      |           |  15375 | f     | {15375} |       | {27276}

**** 9.1.1
-- SELECT * FROM test_rowlock FOR SHARE;
 locked_row | lock_type | locker | multi | xids  |  pids   
------------+-----------+--------+-------+-------+---------
 (0,1)      | Shared    |    757 | f     | {757} | {27349}
-- SELECT * FROM test_rowlock FOR UPDATE;
 locked_row | lock_type | locker | multi | xids  |  pids   
------------+-----------+--------+-------+-------+---------
 (0,1)      | Exclusive |    758 | f     | {758} | {27349}
-- UPDATE test_rowlock SET non_key_col = 11;
 locked_row | lock_type | locker | multi | xids  |  pids   
------------+-----------+--------+-------+-------+---------
 (0,1)      | Exclusive |    759 | f     | {759} | {27349}
-- UPDATE test_rowlock SET key_col = 2;
 locked_row | lock_type | locker | multi | xids  |  pids   
------------+-----------+--------+-------+-------+---------
 (0,1)      | Exclusive |    760 | f     | {760} | {27349}

I've attached fklock-pgrowlocks.sql, used to produce the above results.  In
particular, the absence of any distinction between the key_col and non_key_col
update scenarios is suboptimal.  Also, the "SELECT * FROM test_rowlock FOR
UPDATE" ought not to use a multi, right?  Choices such as letting lock_type be
blank or contain IsNotUpdate tend to make the output reflect implementation
details more than user-relevant lock semantics.  One could go either way on
those.  I tend to think pageinspect is for raw implementation exposure and
pgrowlocks for something more cooked.

I have not reviewed your actual pgrowlocks code changes.

> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

heap_xlog_update() still sets xmax of the old tuple and xmin of the new tuple
based on the XLogRecord, so it will always be the plain xid of the updater.  I
suppose that's still fine, because locks don't matter after a crash.  I
suggest adding a comment, though.

> @@ -1620,7 +1622,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
>  				   ItemPointerGetBlockNumber(tid));
>  			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
>  			at_chain_start = false;
> -			prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
> +			prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple->t_data);
>  		}
>  		else
>  			break;				/* end of chain */

The HOT search logic in pruneheap.c needs the same change.

> @@ -1743,7 +1745,7 @@ heap_get_latest_tid(Relation relation,
>  		 * tuple.  Check for XMIN match.
>  		 */
>  		if (TransactionIdIsValid(priorXmax) &&
> -		  !TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(tp.t_data)))
> +			!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(tp.t_data)))

pgindent will undo this change.

> @@ -2174,20 +2178,22 @@ l1:
>  		 */
>  		if (!have_tuple_lock)
>  		{
> -			LockTuple(relation, &(tp.t_self), ExclusiveLock);
> +			LockTuple(relation, &(tp.t_self),
> +					  get_lockmode_for_tuplelock(LockTupleKeyUpdate));

I suggest hiding calls to get_lockmode_for_tuplelock() behind a local macro
wrapper around LockTuple(), itself taking a LockTupleMode directly.

> @@ -2471,8 +2483,14 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
>  	bool		have_tuple_lock = false;
>  	bool		iscombo;
>  	bool		use_hot_update = false;
> +	bool		key_intact;
>  	bool		all_visible_cleared = false;
>  	bool		all_visible_cleared_new = false;
> +	bool			keep_xmax_multi = false;
> +	TransactionId	keep_xmax = InvalidTransactionId;

Those two new variables should be initialized after `l2'.  If the final goto
fires (we lack a needed pin on the visibility map page), their existing values
become invalid.  (Not positive there's a live bug here, but it's fragile.)

> +	TransactionId	keep_xmax_old = InvalidTransactionId;
> +	uint16		keep_xmax_infomask = 0;
> +	uint16		keep_xmax_old_infomask = 0;
>  
>  	Assert(ItemPointerIsValid(otid));
>  
> @@ -2488,7 +2506,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
>  	 * Note that we get a copy here, so we need not worry about relcache flush
>  	 * happening midway through.
>  	 */
> -	hot_attrs = RelationGetIndexAttrBitmap(relation);
> +	hot_attrs = RelationGetIndexAttrBitmap(relation, false);
> +	key_attrs = RelationGetIndexAttrBitmap(relation, true);
>  
>  	block = ItemPointerGetBlockNumber(otid);
>  	buffer = ReadBuffer(relation, block);
> @@ -2513,6 +2532,24 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
>  	oldtup.t_self = *otid;
>  
>  	/*
> +	 * If we're not updating any "key" column, we can grab a milder lock type.
> +	 * This allows for more concurrency when we are running simultaneously with
> +	 * foreign key checks.
> +	 */
> +	if (HeapSatisfiesHOTUpdate(relation, key_attrs, &oldtup, newtup))

This will only pass when the toastedness matches, too.  That is, something
like "UPDATE t SET keycol = keycol || ''" will take the stronger lock if
keycol is toasted.  That's probably for the best -- the important case to
optimize is updates that never manipulate key columns at all, not those that
serendipitously arrive at the same key values.  I wouldn't expect a net win
from the toaster effort needed to recognize the latter case.

Nonetheless, a comment here should note the decision.

> +	{
> +		tuplock = LockTupleUpdate;
> +		mxact_status = MultiXactStatusUpdate;
> +		key_intact = true;
> +	}
> +	else
> +	{
> +		tuplock = LockTupleKeyUpdate;
> +		mxact_status = MultiXactStatusKeyUpdate;
> +		key_intact = false;
> +	}
> +
> +	/*
>  	 * Note: beyond this point, use oldtup not otid to refer to old tuple.
>  	 * otid may very well point at newtup->t_self, which we will overwrite
>  	 * with the new tuple's location, so there's great risk of confusion if we
> @@ -2522,6 +2559,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
>  l2:
>  	result = HeapTupleSatisfiesUpdate(oldtup.t_data, cid, buffer);
>  
> +	/* see below about the "no wait" case */
> +	Assert(result != HeapTupleBeingUpdated || wait);

Maybe just Assert(wait)?  Given that we're breaking that usage, no point in
giving the user any illusions.  However ...

> +
>  	if (result == HeapTupleInvisible)
>  	{
>  		UnlockReleaseBuffer(buffer);
> @@ -2529,8 +2569,21 @@ l2:
>  	}
>  	else if (result == HeapTupleBeingUpdated && wait)
>  	{
> -		TransactionId xwait;
> +		TransactionId	xwait;

pgindent will undo this change.

>  		uint16		infomask;
> +		bool		none_remain = false;
> +
> +		/*
> +		 * XXX note that we don't consider the "no wait" case here.  This
> +		 * isn't a problem currently because no caller uses that case, but it
> +		 * should be fixed if such a caller is introduced.  It wasn't a problem
> +		 * previously because this code would always wait, but now that some
> +		 * tuple locks do not conflict with one of the lock modes we use, it is
> +		 * possible that this case is interesting to handle specially.
> +		 *
> +		 * This may cause failures with third-party code that calls heap_update
> +		 * directly.
> +		 */

... consider that this introduces code drift between heap_update() and the
presently-similar logic in heap_lock_tuple().

>  
>  		/* must copy state data before unlocking buffer */
>  		xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
> @@ -2549,20 +2602,26 @@ l2:
>  		 */
>  		if (!have_tuple_lock)
>  		{
> -			LockTuple(relation, &(oldtup.t_self), ExclusiveLock);
> +			LockTuple(relation, &(oldtup.t_self),
> +					  get_lockmode_for_tuplelock(tuplock));
>  			have_tuple_lock = true;
>  		}
>  
>  		/*
> -		 * Sleep until concurrent transaction ends.  Note that we don't care
> -		 * if the locker has an exclusive or shared lock, because we need
> -		 * exclusive.
> +		 * Now sleep on the locker.  Note that if there are only key-share
> +		 * lockers and we're not updating the key columns, we will be awaken
> +		 * before it is gone, so we may need to mark the new tuple with a
> +		 * new MultiXactId including the original xmax and ourselves.

Well, we'll never actually sleep at all.

> +		 *
> +		 * XXX this comment needs to be more comprehensive
>  		 */
> -
>  		if (infomask & HEAP_XMAX_IS_MULTI)
>  		{
> +			TransactionId	update_xact;
> +			int				remain;
> +
>  			/* wait for multixact */
> -			MultiXactIdWait((MultiXactId) xwait);
> +			MultiXactIdWait((MultiXactId) xwait, mxact_status, &remain);
>  			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
>  
>  			/*
> @@ -2576,41 +2635,98 @@ l2:
>  				goto l2;
>  
>  			/*
> -			 * You might think the multixact is necessarily done here, but not
> -			 * so: it could have surviving members, namely our own xact or
> -			 * other subxacts of this backend.	It is legal for us to update
> -			 * the tuple in either case, however (the latter case is
> -			 * essentially a situation of upgrading our former shared lock to
> -			 * exclusive).	We don't bother changing the on-disk hint bits
> -			 * since we are about to overwrite the xmax altogether.
> +			 * Note that the multixact may not be done by now.  It could have
> +			 * surviving members; our own xact or other subxacts of this
> +			 * backend, and also any other concurrent transaction that locked
> +			 * the tuple with KeyShare if we only got TupleLockUpdate.  If this
> +			 * is the case, we have to be careful to mark the updated tuple
> +			 * with the surviving members in Xmax.
> +			 *
> +			 * Note that there could have been another update in the MultiXact.
> +			 * In that case, we need to check whether it committed or aborted.
> +			 * If it aborted we are safe to update it again; otherwise there is
> +			 * an update conflict that must be handled below.

It's handled below in the sense that we bail, returning HeapTupleUpdated?

> +			 *
> +			 * In the LockTupleKeyUpdate case, we still need to preserve the
> +			 * surviving members: those would include the tuple locks we had
> +			 * before this one, which are important to keep in case this
> +			 * subxact aborts.
>  			 */
> +			update_xact = InvalidTransactionId;
> +			if (!(oldtup.t_data->t_infomask & HEAP_XMAX_IS_NOT_UPDATE))
> +				update_xact = HeapTupleGetUpdateXid(oldtup.t_data);
> +
> +			/* there was no UPDATE in the MultiXact; or it aborted. */
> +			if (update_xact == InvalidTransactionId ||
> +				TransactionIdDidAbort(update_xact))
> +			{
> +				/*
> +				 * if the multixact still has live members, we need to preserve
> +				 * it by creating a new multixact.  If all members are gone, we
> +				 * can simply update the tuple by setting ourselves in Xmax.
> +				 */
> +				if (remain > 0)
> +				{
> +					keep_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
> +					keep_xmax_multi =
> +						(oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) != 0;

Will keep_xmax_multi ever be false?  Would we not have exited at the above
"goto l2;" in those cases?

> +				}
> +				else
> +				{
> +					/*
> +					 * We could set the HEAP_XMAX_INVALID bit here instead of
> +					 * using a separate boolean flag.  However, since we're going
> +					 * to set up a new xmax below, this would waste time
> +					 * setting up the buffer's dirty bit.
> +					 */
> +					none_remain = false;
> +				}
> +			}
>  		}
>  		else

This would require less reindentation as an "else if", rather than "else { if".

>  		{
> -			/* wait for regular transaction to end */
> -			XactLockTableWait(xwait);
> -			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> -
>  			/*
> -			 * xwait is done, but if xwait had just locked the tuple then some
> -			 * other xact could update this tuple before we get to this point.
> -			 * Check for xmax change, and start over if so.
> +			 * If it's just a key-share locker, and we're not changing the
> +			 * key columns, we don't need to wait for it to wait; but we
> +			 * need to preserve it as locker.
>  			 */
> -			if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
> -				!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
> -									 xwait))
> -				goto l2;
> +			if ((oldtup.t_data->t_infomask & HEAP_XMAX_KEYSHR_LOCK) &&
> +				key_intact)

You don't have a content lock on the buffer at this point, so the test should
be against "infomask", not "oldtup.t_data->t_infomask".

> +			{
> +				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> +				keep_xmax = xwait;
> +				keep_xmax_multi = false;
> +			}

Like the other branches, this one needs to recheck the t_infomask after
reacquiring the content lock.

It would be nice to completely avoid releasing the content lock in cases that
don't involve any waiting.  However, since that (have_tuple_lock = true) is
already something of a slow path, I doubt it's worth the complexity.

> +			else
> +			{
> +				/* wait for regular transaction to end */
> +				XactLockTableWait(xwait);
> +				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
>  
> -			/* Otherwise check if it committed or aborted */
> -			UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
> +				/*
> +				 * xwait is done, but if xwait had just locked the tuple then some
> +				 * other xact could update this tuple before we get to this point.
> +				 * Check for xmax change, and start over if so.
> +				 */
> +				if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
> +					!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
> +										 xwait))
> +					goto l2;
> +
> +				/* Otherwise check if it committed or aborted */
> +				UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
> +			}
>  		}
>  
>  		/*
>  		 * We may overwrite if previous xmax aborted, or if it committed but
> -		 * only locked the tuple without updating it.
> +		 * only locked the tuple without updating it, or if we are going to
> +		 * keep it around in Xmax.
>  		 */
> -		if (oldtup.t_data->t_infomask & (HEAP_XMAX_INVALID |
> -										 HEAP_IS_LOCKED))
> +		if (TransactionIdIsValid(keep_xmax) ||
> +			none_remain ||
> +			(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
> +			HeapTupleHeaderIsLocked(oldtup.t_data))

When is the HeapTupleHeaderIsLocked(oldtup.t_data) condition needed?  Offhand,
I'd think none_remain = true and HEAP_XMAX_INVALID cover its cases.

>  			result = HeapTupleMayBeUpdated;
>  		else
>  			result = HeapTupleUpdated;
> @@ -2630,13 +2746,15 @@ l2:
>  			   result == HeapTupleBeingUpdated);
>  		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
>  		*ctid = oldtup.t_data->t_ctid;
> -		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
> +		*update_xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
>  		UnlockReleaseBuffer(buffer);
>  		if (have_tuple_lock)
> -			UnlockTuple(relation, &(oldtup.t_self), ExclusiveLock);
> +			UnlockTuple(relation, &(oldtup.t_self),
> +						get_lockmode_for_tuplelock(tuplock));
>  		if (vmbuffer != InvalidBuffer)
>  			ReleaseBuffer(vmbuffer);
>  		bms_free(hot_attrs);
> +		bms_free(key_attrs);
>  		return result;
>  	}
>  
> @@ -2645,7 +2763,7 @@ l2:
>  	 * visible while we were busy locking the buffer, or during some subsequent
>  	 * window during which we had it unlocked, we'll have to unlock and
>  	 * re-lock, to avoid holding the buffer lock across an I/O.  That's a bit
> -	 * unfortunate, esepecially since we'll now have to recheck whether the
> +	 * unfortunate, especially since we'll now have to recheck whether the
>  	 * tuple has been locked or updated under us, but hopefully it won't
>  	 * happen very often.
>  	 */
> @@ -2678,13 +2796,54 @@ l2:
>  		Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
>  	}
>  
> +	/*
> +	 * If the tuple we're updating is locked, we need to preserve this in the
> +	 * new tuple's Xmax as well as in the old tuple.  Prepare the new xmax
> +	 * value for these uses.
> +	 *
> +	 * Note there cannot be an xmax to save if we're changing key columns; in
> +	 * this case, the wait above should have only returned when the locking
> +	 * transactions finished.
> +	 */
> +	if (TransactionIdIsValid(keep_xmax))
> +	{
> +		if (keep_xmax_multi)
> +		{
> +			keep_xmax_old = MultiXactIdExpand(keep_xmax,
> +											  xid, MultiXactStatusUpdate);
> +			keep_xmax_infomask = HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_IS_MULTI;

Not directly related to this line, but is the HEAP_IS_NOT_UPDATE bit getting
cleared where needed?

> +		}
> +		else
> +		{
> +			/* not a multi? must be a KEY SHARE locker */
> +			keep_xmax_old = MultiXactIdCreate(keep_xmax, MultiXactStatusForKeyShare,
> +											  xid, MultiXactStatusUpdate);
> +			keep_xmax_infomask = HEAP_XMAX_KEYSHR_LOCK;
> +		}
> +		keep_xmax_old_infomask = HEAP_XMAX_IS_MULTI | HEAP_XMAX_KEYSHR_LOCK;
> +		/* FIXME -- need more infomask bits? */

Maybe ... I haven't thought it all through.

> +	}
> +
> +	/*
> +	 * Prepare the new tuple with the appropriate initial values of Xmin and
> +	 * Xmax, as well as initial infomask bits.
> +	 */
>  	newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
>  	newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
> -	newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED);
> +	newtup->t_data->t_infomask |= HEAP_UPDATED;
>  	HeapTupleHeaderSetXmin(newtup->t_data, xid);
>  	HeapTupleHeaderSetCmin(newtup->t_data, cid);
> -	HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
>  	newtup->t_tableOid = RelationGetRelid(relation);
> +	if (TransactionIdIsValid(keep_xmax))
> +	{
> +		newtup->t_data->t_infomask |= keep_xmax_infomask;
> +		HeapTupleHeaderSetXmax(newtup->t_data, keep_xmax);
> +	}
> +	else
> +	{
> +		newtup->t_data->t_infomask |= HEAP_XMAX_INVALID;
> +		HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
> +	}
>  
>  	/*
>  	 * Replace cid with a combo cid if necessary.  Note that we already put
> @@ -2725,11 +2884,20 @@ l2:
>  		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
>  									   HEAP_XMAX_INVALID |
>  									   HEAP_XMAX_IS_MULTI |
> -									   HEAP_IS_LOCKED |
> +									   HEAP_LOCK_BITS |
>  									   HEAP_MOVED);
> +		oldtup.t_data->t_infomask2 &= ~HEAP_UPDATE_KEY_INTACT;
>  		HeapTupleClearHotUpdated(&oldtup);
>  		/* ... and store info about transaction updating this tuple */
> -		HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> +		if (TransactionIdIsValid(keep_xmax_old))
> +		{
> +			HeapTupleHeaderSetXmax(oldtup.t_data, keep_xmax_old);
> +			oldtup.t_data->t_infomask |= keep_xmax_old_infomask;
> +		}
> +		else
> +			HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> +		if (key_intact)
> +			oldtup.t_data->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
>  		HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
>  		/* temporarily make it look not-updated */
>  		oldtup.t_data->t_ctid = oldtup.t_self;

Shortly after this, we release the content lock and go off toasting the tuple
and finding free space.  When we come back, could the old tuple have
accumulated additional KEY SHARE locks that we need to re-copy?

> @@ -2883,10 +3051,19 @@ l2:
>  		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
>  									   HEAP_XMAX_INVALID |
>  									   HEAP_XMAX_IS_MULTI |
> -									   HEAP_IS_LOCKED |
> +									   HEAP_LOCK_BITS |
>  									   HEAP_MOVED);
> +		oldtup.t_data->t_infomask2 &= ~HEAP_UPDATE_KEY_INTACT;
>  		/* ... and store info about transaction updating this tuple */
> -		HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> +		if (TransactionIdIsValid(keep_xmax_old))
> +		{
> +			HeapTupleHeaderSetXmax(oldtup.t_data, keep_xmax_old);
> +			oldtup.t_data->t_infomask |= keep_xmax_old_infomask;
> +		}
> +		else
> +			HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> +		if (key_intact)
> +			oldtup.t_data->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
>  		HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
>  	}

> @@ -3201,13 +3429,13 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
>  	Page		page;
>  	TransactionId xid;
>  	TransactionId xmax;
> +	TransactionId keep_xmax = InvalidTransactionId;
> +	bool		keep_xmax_multi = false;
> +	bool		none_remains = false;
>  	uint16		old_infomask;
>  	uint16		new_infomask;
> -	LOCKMODE	tuple_lock_type;
>  	bool		have_tuple_lock = false;
>  
> -	tuple_lock_type = (mode == LockTupleShared) ? ShareLock : ExclusiveLock;
> -
>  	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
>  	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>  
> @@ -3220,6 +3448,9 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
>  	tuple->t_tableOid = RelationGetRelid(relation);
>  
>  l3:
> +	/* shouldn't get back here if we already set keep_xmax */
> +	Assert(keep_xmax == InvalidTransactionId);
> +
>  	result = HeapTupleSatisfiesUpdate(tuple->t_data, cid, *buffer);
>  
>  	if (result == HeapTupleInvisible)
> @@ -3231,30 +3462,70 @@ l3:
>  	{
>  		TransactionId xwait;
>  		uint16		infomask;
> +		uint16		infomask2;
> +		bool		require_sleep;
>  
>  		/* must copy state data before unlocking buffer */
>  		xwait = HeapTupleHeaderGetXmax(tuple->t_data);
>  		infomask = tuple->t_data->t_infomask;
> +		infomask2 = tuple->t_data->t_infomask2;
>  
>  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
>  
>  		/*
> -		 * If we wish to acquire share lock, and the tuple is already
> -		 * share-locked by a multixact that includes any subtransaction of the
> -		 * current top transaction, then we effectively hold the desired lock
> -		 * already.  We *must* succeed without trying to take the tuple lock,
> -		 * else we will deadlock against anyone waiting to acquire exclusive
> -		 * lock.  We don't need to make any state changes in this case.
> +		 * If we wish to acquire share or key lock, and the tuple is already
> +		 * key or share locked by a multixact that includes any subtransaction
> +		 * of the current top transaction, then we effectively hold the desired
> +		 * lock already (except if we own key share lock and now desire share
> +		 * lock).  We *must* succeed without trying to take the tuple lock,

This can now apply to FOR UPDATE as well.

For the first sentence, I suggest the wording "If any subtransaction of the
current top transaction already holds a stronger lock, we effectively hold the
desired lock already."

> +		 * else we will deadlock against anyone wanting to acquire a stronger
> +		 * lock.

> +		 *
> +		 * FIXME -- we don't do the below currently, but I think we should:
> +		 *
> +		 * We update the Xmax with a new MultiXactId to include the new lock
> +		 * mode in this case.
> +		 *
> +		 * Note that since we want to alter the Xmax, we need to re-acquire the
> +		 * buffer lock.  The xmax could have changed in the meantime, so we
> +		 * recheck it in that case, but we keep the buffer lock while doing it
> +		 * to prevent starvation.  The second time around we know we must be
> +		 * part of the MultiXactId in any case, which is why we don't need to
> +		 * go back to recheck HeapTupleSatisfiesUpdate.  Also, after we
> +		 * re-acquire lock, the MultiXact is likely to (but not necessarily) be
> +		 * the same that we see here, so it should be in multixact's cache and
> +		 * thus quick to obtain.

What is the benefit of doing so?

>  		 */
> -		if (mode == LockTupleShared &&
> -			(infomask & HEAP_XMAX_IS_MULTI) &&
> -			MultiXactIdIsCurrent((MultiXactId) xwait))
> +		if ((infomask & HEAP_XMAX_IS_MULTI) &&
> +			((mode == LockTupleShare) || (mode == LockTupleKeyShare)))
>  		{
> -			Assert(infomask & HEAP_XMAX_SHARED_LOCK);
> -			/* Probably can't hold tuple lock here, but may as well check */
> -			if (have_tuple_lock)
> -				UnlockTuple(relation, tid, tuple_lock_type);
> -			return HeapTupleMayBeUpdated;
> +			int		i;
> +			int		nmembers;
> +			MultiXactMember *members;
> +
> +			nmembers = GetMultiXactIdMembers(xwait, &members);
> +
> +			for (i = 0; i < nmembers; i++)
> +			{
> +				if (TransactionIdIsCurrentTransactionId(members[i].xid))

This does not handle subtransactions like the previous code.

I have not yet reviewed the rest of the heap_lock_tuple() changes.

> @@ -3789,6 +4305,8 @@ recheck_xmax:
>  		 * extremely low-probability scenario with minimal downside even if
>  		 * it does happen, so for now we don't do the extra bookkeeping that
>  		 * would be needed to clean out MultiXactIds.
> +		 *
> +		 * FIXME -- today is that day.  Figure this out.

Yep.  I think you can just use HeapTupleHeaderGetUpdateXid() and remove the
explicit conditional on HEAP_XMAX_IS_MULTI.

> @@ -3919,6 +4536,7 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
>  									   TransactionId *latestRemovedXid)
>  {
>  	TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
> +	/* FIXME -- change this? */
>  	TransactionId xmax = HeapTupleHeaderGetXmax(tuple);

Yes.  Since this function is only passed dead tuples, it could previously
expect to never see a multixact xmax.  No longer.

> @@ -4991,14 +5609,18 @@ heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
>  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
>  						  HEAP_XMAX_INVALID |
>  						  HEAP_XMAX_IS_MULTI |
> -						  HEAP_IS_LOCKED |
> +						  HEAP_LOCK_BITS |
>  						  HEAP_MOVED);
> -	if (xlrec->xid_is_mxact)
> +	if (xlrec->infobits_set & XLHL_XMAX_IS_MULTI)
>  		htup->t_infomask |= HEAP_XMAX_IS_MULTI;
> -	if (xlrec->shared_lock)
> -		htup->t_infomask |= HEAP_XMAX_SHARED_LOCK;
> -	else
> +	if (xlrec->infobits_set & XLHL_XMAX_IS_NOT_UPDATE)
> +		htup->t_infomask |= HEAP_XMAX_IS_NOT_UPDATE;
> +	if (xlrec->infobits_set & XLHL_XMAX_EXCL_LOCK)
>  		htup->t_infomask |= HEAP_XMAX_EXCL_LOCK;
> +	if (xlrec->infobits_set & XLHL_XMAX_KEYSHR_LOCK)
> +		htup->t_infomask |= HEAP_XMAX_KEYSHR_LOCK;
> +	if (xlrec->infobits_set & XLHL_UPDATE_KEY_INTACT)
> +		htup->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
>  	HeapTupleHeaderClearHotUpdated(htup);
>  	HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>  	HeapTupleHeaderSetCmax(htup, FirstCommandId, false);

Just after here is this code:

	/* Make sure there is no forward chain link in t_ctid */
	htup->t_ctid = xlrec->target.tid;

Now that a KEY SHARE locker could apply over an UPDATE, that's no longer
always valid.

Incidentally, why is this level of xlog detail needed for tuple locks?  We
need an FPI of the page before the lock-related changes start scribbling on
it, and we need to log any xid, even that of a locker, that could land in the
heap on disk.  But, why do we actually need to replay each lock?

> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -4,7 +4,7 @@
>   *		PostgreSQL multi-transaction-log manager
>   *
>   * The pg_multixact manager is a pg_clog-like manager that stores an array
> - * of TransactionIds for each MultiXactId.	It is a fundamental part of the
> + * of MultiXactMember for each MultiXactId.	It is a fundamental part of the
>   * shared-row-lock implementation.	A share-locked tuple stores a
>   * MultiXactId in its Xmax, and a transaction that needs to wait for the
>   * tuple to be unlocked can sleep on the potentially-several TransactionIds

This header comment (including more than the portion quoted here) needs
further updates.  In particular, there's no direct reference to the flag bits
now stored with each member xid.  Also, the comment only mentions merely
preserving state across crashes, but this data now has the pg_clog life cycle.
Consider mentioning that the name is a bit historical: a singleton multixact
now has value for storing flags having no other expression.

> @@ -48,6 +48,8 @@
>   */
>  #include "postgres.h"
>  
> +#include <unistd.h>
> +
>  #include "access/multixact.h"
>  #include "access/slru.h"
>  #include "access/transam.h"
> @@ -60,6 +62,7 @@
>  #include "storage/procarray.h"
>  #include "utils/builtins.h"
>  #include "utils/memutils.h"
> +#include "utils/snapmgr.h"
>  
>  
>  /*
> @@ -75,19 +78,58 @@
>   * (see MultiXact{Offset,Member}PagePrecedes).
>   */

The comment just ending here mentions "MULTIXACT_*_PER_PAGE", but it's now
only correct for MULTIXACT_OFFSETS_PER_PAGE.

> @@ -180,7 +213,8 @@ static MultiXactId *OldestVisibleMXactId;
>   * so they will be uninteresting by the time our next transaction starts.
>   * (XXX not clear that this is correct --- other members of the MultiXact
>   * could hang around longer than we did.  However, it's not clear what a
> - * better policy for flushing old cache entries would be.)
> + * better policy for flushing old cache entries would be.)  FIXME actually
> + * this is plain wrong now that multixact's may contain update Xids.

A key role of the cache is to avoid creating vast numbers of multixacts each
having the same membership.  In that role, the existing policy seems no less
suitable than before.  I agree that this patch makes the policy less suitable
for readers, though.  Not sure what should be done about that, if anything.

> @@ -235,29 +297,59 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
>  						MultiXactOffset offset2);
>  static void ExtendMultiXactOffset(MultiXactId multi);
>  static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
> -static void TruncateMultiXact(void);
> -static void WriteMZeroPageXlogRec(int pageno, uint8 info);
> +static void fillSegmentInfoData(SlruCtl ctl, SegmentInfo *segment);
> +static int	compareTruncateXidEpoch(const void *a, const void *b);
> +static void WriteMZeroOffsetPageXlogRec(int pageno, TransactionId truncateXid,
> +							uint32 truncateXidEpoch);
> +static void WriteMZeroMemberPageXlogRec(int pageno);
>  
>  
>  /*
> + * MultiXactIdCreateSingleton
> + * 		Construct a MultiXactId representing a single transaction.

I suggest mentioning that this is useful for marking a tuple in a manner that
can only be achieved through multixact flags.

> + *
> + * NB - we don't worry about our local MultiXactId cache here, because that
> + * is handled by the lower-level routines.
> + */
> +MultiXactId
> +MultiXactIdCreateSingleton(TransactionId xid, MultiXactStatus status)
> +{
> +	MultiXactId	newMulti;
> +	MultiXactMember	member[1];
> +
> +	AssertArg(TransactionIdIsValid(xid));
> +
> +	member[0].xid = xid;
> +	member[0].status = status;
> +
> +	newMulti = CreateMultiXactId(1, member);
> +
> +	debug_elog4(DEBUG2, "Create: returning %u for %u",
> +			   newMulti, xid);
> +
> +	return newMulti;
> +}
> +
> +/*
>   * MultiXactIdCreate
>   *		Construct a MultiXactId representing two TransactionIds.
>   *
> - * The two XIDs must be different.
> + * The two XIDs must be different, or be requesting different lock modes.

Why is it not sufficient to store the strongest type for a particular xid?

> @@ -376,7 +480,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid)
>  bool
>  MultiXactIdIsRunning(MultiXactId multi)
>  {
> -	TransactionId *members;
> +	MultiXactMember *members;
>  	int			nmembers;
>  	int			i;
>  
> @@ -397,7 +501,7 @@ MultiXactIdIsRunning(MultiXactId multi)
>  	 */
>  	for (i = 0; i < nmembers; i++)
>  	{
> -		if (TransactionIdIsCurrentTransactionId(members[i]))
> +		if (TransactionIdIsCurrentTransactionId(members[i].xid))
>  		{
>  			debug_elog3(DEBUG2, "IsRunning: I (%d) am running!", i);
>  			pfree(members);
> @@ -412,10 +516,10 @@ MultiXactIdIsRunning(MultiXactId multi)
>  	 */

Just before here, there's a comment referring to the now-nonexistent
MultiXactIdIsCurrent().

> @@ -576,17 +541,24 @@ MultiXactIdSetOldestVisible(void)
>   * this would not merely be useless but would lead to Assert failure inside
>   * XactLockTableWait.  By the time this returns, it is certain that all
>   * transactions *of other backends* that were members of the MultiXactId
> - * are dead (and no new ones can have been added, since it is not legal
> - * to add members to an existing MultiXactId).
> + * that conflict with the requested status are dead (and no new ones can have
> + * been added, since it is not legal to add members to an existing
> + * MultiXactId).
> + *
> + * We return the number of members that we did not test for.  This is dubbed
> + * "remaining" as in "the number of members that remaing running", but this is

Typo: "remaing".

> + * slightly incorrect, because lockers whose status did not conflict with ours
> + * are not even considered and so might have gone away anyway.
>   *
>   * But by the time we finish sleeping, someone else may have changed the Xmax
>   * of the containing tuple, so the caller needs to iterate on us somehow.
>   */
>  void
> -MultiXactIdWait(MultiXactId multi)
> +MultiXactIdWait(MultiXactId multi, MultiXactStatus status, int *remaining)

This function should probably move (with a new name) to heapam.c (or maybe
lmgr.c, in part).  It's an abstraction violation to have multixact.c knowing
about lock conflict tables.  multixact.c should be marshalling those two bits
alongside each xid without any deep knowledge of their meaning.

> @@ -663,7 +649,7 @@ CreateMultiXactId(int nxids, TransactionId *xids)
>  	xl_multixact_create xlrec;
>  
>  	debug_elog3(DEBUG2, "Create: %s",
> -				mxid_to_string(InvalidMultiXactId, nxids, xids));
> +				mxid_to_string(InvalidMultiXactId, nmembers, members));
>  
>  	/*
>  	 * See if the same set of XIDs already exists in our cache; if so, just

XIDs -> members

> @@ -870,13 +875,14 @@ GetNewMultiXactId(int nxids, MultiXactOffset *offset)
>  	 *
>  	 * We don't care about MultiXactId wraparound here; it will be handled by
>  	 * the next iteration.	But note that nextMXact may be InvalidMultiXactId
> -	 * after this routine exits, so anyone else looking at the variable must
> -	 * be prepared to deal with that.  Similarly, nextOffset may be zero, but
> -	 * we won't use that as the actual start offset of the next multixact.
> +	 * or the first value on a segment-beggining page after this routine exits,

Typo: "beggining".

> @@ -904,64 +932,61 @@ GetMultiXactIdMembers(MultiXactId multi, TransactionId **xids)
>  	int			length;
>  	int			truelength;
>  	int			i;
> +	MultiXactId oldestMXact;
>  	MultiXactId nextMXact;
>  	MultiXactId tmpMXact;
>  	MultiXactOffset nextOffset;
> -	TransactionId *ptr;
> +	MultiXactMember *ptr;
>  
>  	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
>  
>  	Assert(MultiXactIdIsValid(multi));
>  
>  	/* See if the MultiXactId is in the local cache */
> -	length = mXactCacheGetById(multi, xids);
> +	length = mXactCacheGetById(multi, members);
>  	if (length >= 0)
>  	{
>  		debug_elog3(DEBUG2, "GetMembers: found %s in the cache",
> -					mxid_to_string(multi, length, *xids));
> +					mxid_to_string(multi, length, *members));
>  		return length;
>  	}
>  
> -	/* Set our OldestVisibleMXactId[] entry if we didn't already */
> -	MultiXactIdSetOldestVisible();
> -
>  	/*
>  	 * We check known limits on MultiXact before resorting to the SLRU area.
>  	 *
> -	 * An ID older than our OldestVisibleMXactId[] entry can't possibly still
> -	 * be running, and we'd run the risk of trying to read already-truncated
> -	 * SLRU data if we did try to examine it.
> +	 * An ID older than MultiXactState->oldestMultiXactId cannot possibly be
> +	 * useful; it should have already been frozen by vacuum.  We've truncated
> +	 * the on-disk structures anyway, so we return empty if such a value is
> +	 * queried.

Per the "XXX perhaps someday" comment in heap_freeze_tuple(), the implication
of probing for an old multixact record has been heretofore minor.  From now,
it can mean making the wrong visibility decision.  Enter data loss.  Hence, an
elog(ERROR) is normally in order.  For the benefit of binary upgrades, we
could be permissive in the face of HEAP_XMAX_IS_NOT_UPDATE (formerly known as
HEAP_XMAX_SHARED_LOCK).

>  	 *
>  	 * Conversely, an ID >= nextMXact shouldn't ever be seen here; if it is
>  	 * seen, it implies undetected ID wraparound has occurred.	We just
>  	 * silently assume that such an ID is no longer running.

Likewise, this is now fatal.

This raises a notable formal hazard: it's possible to burn through the
MultiXactId space faster than the regular TransactionId space.  We could get
into a situation where pg_clog is covering 2B xids, and yet we need >4B
MultiXactId to cover that period.  We had better at least notice this and
halt, if not have autovacuum actively prevent it.

> @@ -1026,9 +1051,8 @@ retry:
>  	{
>  		MultiXactOffset nextMXOffset;
>  
> -		/* handle wraparound if needed */
> -		if (tmpMXact < FirstMultiXactId)
> -			tmpMXact = FirstMultiXactId;
> +		/* Handle corner cases if needed */
> +		tmpMXact = HandleMxactOffsetCornerCases(tmpMXact);

Is there a reason apart from cycle shaving to increment a MultiXactId in one
place and elsewhere fix up the incremented value to skip the special values?
Compare to just having a MultiXactIdIncrement() function.  This isn't new with
your patch, but it certainly looks odd.

> @@ -1113,26 +1170,27 @@ retry:
>   * for the majority of tuples, thus keeping MultiXactId usage low (saving
>   * both I/O and wraparound issues).
>   *
> - * NB: the passed xids[] array will be sorted in-place.
> + * NB: the passed members array will be sorted in-place.
>   */
>  static MultiXactId
> -mXactCacheGetBySet(int nxids, TransactionId *xids)
> +mXactCacheGetBySet(int nmembers, MultiXactMember *members)
>  {
>  	mXactCacheEnt *entry;
>  
>  	debug_elog3(DEBUG2, "CacheGet: looking for %s",
> -				mxid_to_string(InvalidMultiXactId, nxids, xids));
> +				mxid_to_string(InvalidMultiXactId, nmembers, members));
>  
>  	/* sort the array so comparison is easy */
> -	qsort(xids, nxids, sizeof(TransactionId), xidComparator);
> +	qsort(members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
>  
>  	for (entry = MXactCache; entry != NULL; entry = entry->next)
>  	{
> -		if (entry->nxids != nxids)
> +		if (entry->nmembers != nmembers)
>  			continue;
>  
>  		/* We assume the cache entries are sorted */
> -		if (memcmp(xids, entry->xids, nxids * sizeof(TransactionId)) == 0)
> +		/* XXX we assume the unused bits in "status" are zeroed */

That's a fair assumption if the public entry points assert it.  However, ...

> +		if (memcmp(members, entry->members, nmembers * sizeof(MultiXactMember)) == 0)

... this also assumes the structure has no padding.  To make that safe,
MultiXactStatus should be an int32, not an enum.

> @@ -1338,17 +1367,7 @@ void
>  multixact_twophase_recover(TransactionId xid, uint16 info,
>  						   void *recdata, uint32 len)
>  {
> -	BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid);
> -	MultiXactId oldestMember;
> -
> -	/*
> -	 * Get the oldest member XID from the state file record, and set it in the
> -	 * OldestMemberMXactId slot reserved for this prepared transaction.
> -	 */
> -	Assert(len == sizeof(MultiXactId));
> -	oldestMember = *((MultiXactId *) recdata);
> -
> -	OldestMemberMXactId[dummyBackendId] = oldestMember;
> +	/* nothing to do */
>  }
>  
>  /*
> @@ -1359,11 +1378,7 @@ void
>  multixact_twophase_postcommit(TransactionId xid, uint16 info,
>  							  void *recdata, uint32 len)
>  {
> -	BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid);
> -
> -	Assert(len == sizeof(MultiXactId));
> -
> -	OldestMemberMXactId[dummyBackendId] = InvalidMultiXactId;
> +	/* nothing to do */
>  }
>  
>  /*
> @@ -1374,7 +1389,7 @@ void
>  multixact_twophase_postabort(TransactionId xid, uint16 info,
>  							 void *recdata, uint32 len)
>  {
> -	multixact_twophase_postcommit(xid, info, recdata, len);
> +	/* nothing to do */
>  }

Looks like you can completely remove TWOPHASE_RM_MULTIXACT_ID.

>  	/*
> -	 * Also truncate MultiXactMember at the previously determined offset.
> +	 * FIXME there's a race condition here: somebody might have created a new
> +	 * segment after we finished scanning the dir.  That scenario would leave
> +	 * us with an invalid truncateXid in shared memory, which is not an easy
> +	 * situation to get out of.  Needs more thought.

Agreed.  Not sure.

Broadly, this feels like a lot of code to handle truncating the segments, but
I don't know how to simplify it.

> @@ -1947,13 +2130,29 @@ MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
>  	return (diff < 0);
>  }
>  
> +static void
> +WriteMZeroOffsetPageXlogRec(int pageno, TransactionId truncateXid,
> +							uint32 truncateXidEpoch)
> +{
> +	XLogRecData	rdata;
> +	MxactZeroOffPg zerooff;
> +
> +	zerooff.pageno = pageno;
> +	zerooff.truncateXid = truncateXid;
> +	zerooff.truncateXidEpoch = truncateXidEpoch;
> +
> +	rdata.data = (char *) (&zerooff);
> +	rdata.len = sizeof(MxactZeroOffPg);

A MinSizeOf* macro is more conventional.

> +	rdata.buffer = InvalidBuffer;
> +	rdata.next = NULL;
> +	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_ZERO_OFF_PAGE, &rdata);
> +}

> --- a/src/backend/utils/time/tqual.c
> +++ b/src/backend/utils/time/tqual.c
> @@ -966,10 +1088,25 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
>  			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
>  				return true;
>  
> -			if (tuple->t_infomask & HEAP_IS_LOCKED)		/* not deleter */
> +			if (HeapTupleHeaderIsLocked(tuple))		 /* not deleter */
>  				return true;
>  
> -			Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
> +			if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
> +			{
> +				TransactionId	xmax;
> +
> +				xmax = HeapTupleGetUpdateXid(tuple);
> +				if (!TransactionIdIsValid(xmax))
> +					return true;

When does this happen?  Offhand, I'd expect the HeapTupleHeaderIsLocked() test
to keep us from reaching this scenario.  Anyway, the next test would catch it.

> +
> +				/* updating subtransaction must have aborted */
> +				if (!TransactionIdIsCurrentTransactionId(xmax))
> +					return true;
> +				else if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
> +					return true;	/* updated after scan started */
> +				else
> +					return false;	/* updated before scan started */
> +			}
>  
>  			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tuple)))
>  			{
> @@ -1008,13 +1145,34 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
>  	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
>  		return true;
>  
> -	if (tuple->t_infomask & HEAP_IS_LOCKED)
> +	if (HeapTupleHeaderIsLocked(tuple))
>  		return true;
>  
>  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
>  	{
> -		/* MultiXacts are currently only allowed to lock tuples */
> -		Assert(tuple->t_infomask & HEAP_IS_LOCKED);
> +		TransactionId	xmax;
> +
> +		if (HeapTupleHeaderIsLocked(tuple))
> +			return true;

This test is redundant with the one just prior.

> +
> +		xmax = HeapTupleGetUpdateXid(tuple);
> +		if (TransactionIdIsCurrentTransactionId(xmax))
> +		{
> +			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
> +				return true;	/* deleted after scan started */
> +			else
> +				return false;	/* deleted before scan started */
> +		}
> +		if (TransactionIdIsInProgress(xmax))
> +			return true;
> +		if (TransactionIdDidCommit(xmax))
> +		{
> +			SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED, xmax);
> +			/* updating transaction committed, but when? */
> +			if (XidInMVCCSnapshot(xmax, snapshot))
> +				return true;	/* treat as still in progress */
> +			return false;
> +		}

In both HEAP_XMAX_MULTI conditional blocks, you do not set HEAP_XMAX_INVALID
for an aborted updater.  What is the new meaning of HEAP_XMAX_INVALID for
multixacts?  What implications would arise if we instead had it mean that the
updating xid is aborted?  That would allow us to get the mid-term performance
benefit of the hint bit when the updating xid spills into a multixact, and it
would reduce code duplication in this function.

I did not review the other tqual.c changes.  Could you summarize how the
changes to the other functions must differ from the changes to
HeapTupleSatisfiesMVCC()?

> --- a/src/bin/pg_resetxlog/pg_resetxlog.c
> +++ b/src/bin/pg_resetxlog/pg_resetxlog.c
> @@ -332,6 +350,11 @@ main(int argc, char *argv[])
>  	if (set_mxoff != -1)
>  		ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
>  
> +	/*
> +	if (set_mxfreeze != -1)
> +		ControlFile.checkPointCopy.mxactFreezeXid = set_mxfreeze;
> +		*/
> +
>  	if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
>  		ControlFile.checkPointCopy.ThisTimeLineID = minXlogTli;
>  
> @@ -578,6 +601,10 @@ PrintControlValues(bool guessed)
>  		   ControlFile.checkPointCopy.nextMulti);
>  	printf(_("Latest checkpoint's NextMultiOffset:  %u\n"),
>  		   ControlFile.checkPointCopy.nextMultiOffset);
> +	/*
> +	printf(_("Latest checkpoint's MultiXact freezeXid: %u\n"),
> +		   ControlFile.checkPointCopy.mxactFreezeXid);
> +		   */

Should these changes be live?  They look reasonable at first glance.

> --- a/src/include/access/htup.h
> +++ b/src/include/access/htup.h
> @@ -164,12 +164,15 @@ typedef HeapTupleHeaderData *HeapTupleHeader;
>  #define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
>  #define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
>  #define HEAP_HASOID				0x0008	/* has an object-id field */
> -/* bit 0x0010 is available */
> +#define HEAP_XMAX_KEYSHR_LOCK	0x0010	/* xmax is a key-shared locker */
>  #define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
>  #define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
> -#define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
> -/* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
> -#define HEAP_IS_LOCKED	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
> +#define HEAP_XMAX_IS_NOT_UPDATE	0x0080	/* xmax, if valid, is only a locker.
> +										 * Note this is not set unless
> +										 * XMAX_IS_MULTI is also set. */
> +
> +#define HEAP_LOCK_BITS	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \
> +						 HEAP_XMAX_KEYSHR_LOCK)
>  #define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
>  #define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
>  #define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
> @@ -187,14 +190,30 @@ typedef HeapTupleHeaderData *HeapTupleHeader;
>  #define HEAP_XACT_MASK			0xFFE0	/* visibility-related bits */

HEAP_XACT_MASK should gain HEAP_XMAX_KEYSHR_LOCK, becoming 0xFFF0.

>  
>  /*
> + * A tuple is only locked (i.e. not updated by its Xmax) if it the Xmax is not
> + * a multixact and it has either the EXCL_LOCK or KEYSHR_LOCK bits set, or if
> + * the xmax is a multi that doesn't contain an update.
> + *
> + * Beware of multiple evaluation of arguments.
> + */
> +#define HeapTupleHeaderInfomaskIsLocked(infomask) \
> +	((!((infomask) & HEAP_XMAX_IS_MULTI) && \
> +	  (infomask) & (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)) || \
> +	 (((infomask) & HEAP_XMAX_IS_MULTI) && ((infomask) & HEAP_XMAX_IS_NOT_UPDATE)))
> +
> +#define HeapTupleHeaderIsLocked(tup) \
> +	HeapTupleHeaderInfomaskIsLocked((tup)->t_infomask)

I'm uneasy having a HeapTupleHeaderIsLocked() that returns false when a tuple
is both updated and KEY SHARE-locked.  Perhaps HeapTupleHeaderIsUpdated() with
the opposite meaning, or HeapTupleHeaderIsOnlyLocked()?

> +
> +/*
>   * information stored in t_infomask2:
>   */
>  #define HEAP_NATTS_MASK			0x07FF	/* 11 bits for number of attributes */
> -/* bits 0x3800 are available */
> +/* bits 0x1800 are available */
> +#define HEAP_UPDATE_KEY_INTACT	0x2000	/* tuple updated, key cols untouched */
>  #define HEAP_HOT_UPDATED		0x4000	/* tuple was HOT-updated */
>  #define HEAP_ONLY_TUPLE			0x8000	/* this is heap-only tuple */
>  
> -#define HEAP2_XACT_MASK			0xC000	/* visibility-related bits */
> +#define HEAP2_XACT_MASK			0xE000	/* visibility-related bits */
>  
>  /*
>   * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins.  It is
> @@ -221,6 +240,23 @@ typedef HeapTupleHeaderData *HeapTupleHeader;
>  	(tup)->t_choice.t_heap.t_xmin = (xid) \
>  )
>  
> +/*
> + * HeapTupleHeaderGetXmax gets you the raw Xmax field.  To find out the Xid
> + * that updated a tuple, you might need to resolve the MultiXactId if certain
> + * bits are set.  HeapTupleHeaderGetUpdateXid checks those bits and takes care
> + * to resolve the MultiXactId if necessary.  This might involve multixact I/O,
> + * so it should only be used if absolutely necessary.
> + */
> +#define HeapTupleHeaderGetUpdateXid(tup) \
> +( \
> +	(!((tup)->t_infomask & HEAP_XMAX_INVALID) && \
> +	 ((tup)->t_infomask & HEAP_XMAX_IS_MULTI) && \
> +	 !((tup)->t_infomask & HEAP_XMAX_IS_NOT_UPDATE)) ? \
> +		HeapTupleGetUpdateXid(tup) \
> +	: \
> +		HeapTupleHeaderGetXmax(tup) \
> +)
> +
>  #define HeapTupleHeaderGetXmax(tup) \

How about having making HeapTupleHeaderGetXmax() do an AssertMacro() against
HEAP_XMAX_IS_MULTI and adding HeapTupleHeaderGetRawXmax() for places that
truly do not care?

>  ( \
>  	(tup)->t_choice.t_heap.t_xmax \
> @@ -721,16 +757,22 @@ typedef struct xl_heap_newpage
>  
>  #define SizeOfHeapNewpage	(offsetof(xl_heap_newpage, blkno) + sizeof(BlockNumber))
>  
> +/* flags for xl_heap_lock.infobits_set */
> +#define XLHL_XMAX_IS_MULTI		0x01
> +#define XLHL_XMAX_IS_NOT_UPDATE	0x02
> +#define XLHL_XMAX_EXCL_LOCK		0x04
> +#define XLHL_XMAX_KEYSHR_LOCK	0x08
> +#define XLHL_UPDATE_KEY_INTACT	0x10
> +
>  /* This is what we need to know about lock */
>  typedef struct xl_heap_lock
>  {
>  	xl_heaptid	target;			/* locked tuple id */
>  	TransactionId locking_xid;	/* might be a MultiXactId not xid */
> -	bool		xid_is_mxact;	/* is it? */
> -	bool		shared_lock;	/* shared or exclusive row lock? */
> +	int8		infobits_set;	/* infomask and infomask2 bits to set */
>  } xl_heap_lock;
>  
> -#define SizeOfHeapLock	(offsetof(xl_heap_lock, shared_lock) + sizeof(bool))
> +#define SizeOfHeapLock	(offsetof(xl_heap_lock, infobits_set) + sizeof(int8))
>  
>  /* This is what we need to know about in-place update */
>  typedef struct xl_heap_inplace
> @@ -768,8 +810,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
>  extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
>  extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
>  extern void HeapTupleHeaderAdjustCmax(HeapTupleHeader tup,
> -						  CommandId *cmax,
> -						  bool *iscombo);
> +						  CommandId *cmax, bool *iscombo);

Spurious change?

>  
>  /* ----------------
>   *		fastgetattr
> @@ -854,6 +895,9 @@ extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
>  			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
>  	)
>  
> +/* Prototype for HeapTupleHeader accessor in heapam.c */
> +extern TransactionId HeapTupleGetUpdateXid(HeapTupleHeader tuple);
> +
>  /* prototypes for functions in common/heaptuple.c */
>  extern Size heap_compute_data_size(TupleDesc tupleDesc,
>  					   Datum *values, bool *isnull);
> diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
> index c3ec763..ff255d7 100644
> --- a/src/include/access/multixact.h
> +++ b/src/include/access/multixact.h
> @@ -13,8 +13,14 @@
>  
>  #include "access/xlog.h"
>  
> +
> +/*
> + * The first two MultiXactId values are reserved to store the truncation Xid
> + * and epoch of the first segment, so we start assigning multixact values from
> + * 2.
> + */
>  #define InvalidMultiXactId	((MultiXactId) 0)
> -#define FirstMultiXactId	((MultiXactId) 1)
> +#define FirstMultiXactId	((MultiXactId) 2)
>  
>  #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)

Seems like this should reject 1, as well.

> --- a/src/include/access/xlog_internal.h
> +++ b/src/include/access/xlog_internal.h
> @@ -71,7 +71,7 @@ typedef struct XLogContRecord
>  /*
>   * Each page of XLOG file has a header like this:
>   */
> -#define XLOG_PAGE_MAGIC 0xD068	/* can be used as WAL version indicator */
> +#define XLOG_PAGE_MAGIC 0xD069	/* can be used as WAL version indicator */

Need to bump pg_control_version, too.

> --- a/src/test/isolation/expected/fk-contention.out
> +++ b/src/test/isolation/expected/fk-contention.out
> @@ -7,9 +7,8 @@ step upd: UPDATE foo SET b = 'Hello World';
>  
>  starting permutation: ins upd com
>  step ins: INSERT INTO bar VALUES (42);
> -step upd: UPDATE foo SET b = 'Hello World'; <waiting ...>
> +step upd: UPDATE foo SET b = 'Hello World';
>  step com: COMMIT;
> -step upd: <... completed>

Excellent!

Thanks,
nm

In response to

Responses

pgsql-hackers by date

Next:From: Tomas VondraDate: 2011-12-04 14:26:39
Subject: Re: cannot read pg_class without having selected a database / is this a bug?
Previous:From: Magnus HaganderDate: 2011-12-04 11:31:11
Subject: Re: [PATCH] PostgreSQL fails to build with 32bit MinGW-w64

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group