Re: assertion failure 9.3.4

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(dot)dunstan(at)pgexperts(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: assertion failure 9.3.4
Date: 2014-04-21 18:54:22
Message-ID: 20140421185422.GA13906@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I spent the last two hours poking arounds in the environment Andrew
provided and I was able to reproduce the issue, find a assert to
reproduce it much faster and find a possible root cause.

Since the symptom of the problem seem to be multixacts with more than
one updating xid, I added a check to MultiXactIdCreateFromMembers()
preventing that. That requires to move ISUPDATE_from_mxstatus() to a
header, but I think we should definitely add such a assert.

As it turns out the problem is in the
else if (result == HeapTupleBeingUpdated && wait)
branch in (at least) heap_update(). When the problem is hit the
to-be-updated tuple originally has HEAP_XMIN_COMMITTED |
HEAP_XMAX_LOCK_ONLY | HEAP_XMAX_KEYSHR_LOCK set. So we release the
buffer lock, acquire the tuple lock, and reacquire the buffer lock. But
inbetween the locking backend has actually updated the tuple.
The code tries to protect against that with:
/*
* recheck the locker; if someone else changed the tuple while
* we weren't looking, start over.
*/
if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
!TransactionIdEquals(
HeapTupleHeaderGetRawXmax(oldtup.t_data),
xwait))
goto l2;

can_continue = true;
locker_remains = true;

and similar. The problem is that in Andrew's case the infomask changes
from 0x2192 to 0x2102 (i.e. it's a normal update afterwards), while xmax
stays the same. Ooops.
A bit later there's:
result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated;
So, from thereon we happily continue to update the tuple, thinking
there's no previous updater. Which obviously causes problems.

I've hacked^Wfixed this by changing the infomask test above into
infomask != oldtup.t_data->t_infomask in a couple of places. That seems
to be sufficient to survive the testcase a couple of times.

I am too hungry right now to think about a proper fix for this and
whether there's further problematic areas.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-04-21 19:26:03 Re: assertion failure 9.3.4
Previous Message Stephen Frost 2014-04-21 18:14:45 Re: Perfomance degradation 9.3 (vs 9.2) for FreeBSD