Re: Race conditions, race conditions!

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions, race conditions!
Date: 2005-05-08 02:31:49
Message-ID: 20050508.113149.115902969.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Are we going to put the fixes into 8.0.3 and so on? Or will it be
included in 8.0.4?
--
Tatsuo Ishii

> I promised an analysis of the problems Jan Wieck uncovered yesterday,
> so here it is.
>
> Jan had developed a testbed (which I hope he will post) that essentially
> has a bunch of client threads all doing instances of the same
> transaction in READ COMMITTED mode. The intended database invariant is
> that the number of rows in table t2 with a particular ID value is equal
> to the "cnt" field of the single row in t1 with that ID value:
>
> begin;
> select cnt from t1 where id = K for update;
> delete from t2 where id = K;
> -- check returned rowcount to see that exactly CNT rows were deleted
> insert into t2 values (K); -- repeat this N times
> update t1 set cnt = N where id = K;
> commit;
>
> K is randomly chosen for each execution from the known set of t1 keys,
> and N is randomly chosen each time as a small positive integer. This
> should maintain the invariant, since at any time only one transaction
> can be holding the SELECT FOR UPDATE lock on a particular t1 row.
>
> The trick is to run this under astronomical load; 100 or so client
> threads on a garden-variety PC is about right.
>
> What Jan was seeing was that every so often, a client thread would
> error out, reporting that its DELETE had deleted zero rows rather
> than the expected number. But subsequent examination showed the t2
> rows as being there.
>
> Investigation showed that the connected backend had properly acquired
> FOR UPDATE lock on the t1 row, but the snapshot it was using for the
> subsequent DELETE showed the inserter of the t1 row as still running.
> This should be impossible, since the SELECT FOR UPDATE cannot lock an
> uncommitted row, and in READ COMMITTED mode we certainly will take a
> new snapshot for the DELETE.
>
> However, there is a race condition here. During transaction commit,
> xact.c first marks the transaction committed in pg_clog, and then
> clears its XID from PGPROC. This means there is a narrow window
> in which both TransactionIdDidCommit and TransactionIdIsInProgress
> will return true. (We cannot do it the other way around, because
> if neither one is returning true, onlookers are entitled to assume
> that the transaction has crashed.) However, the tqual.c routines
> will allow a row to be seen as committed as soon as
> TransactionIdDidCommit(xmin) returns true. So the scenario is:
>
> 1. Backend A does RecordTransactionCommit to mark itself committed
> in pg_clog, but then loses the CPU in the narrow window between
> doing that and clearing its PGPROC entry. Because of the ridiculous
> load, it doesn't get control back for awhile.
>
> 2. Backend B comes along to run the test transaction for the same K
> value. It inspects the t1 row, concludes it's committed, marks the
> row as locked FOR UPDATE, and returns the results to the client.
>
> 3. The client now issues the DELETE command. B takes a new snapshot,
> but because A is still not cleared out of PGPROC, A's transaction
> is shown as still running in the snapshot.
>
> 4. Now the DELETE will delete no rows, because it doesn't consider the
> t2 rows it should delete to be committed.
>
>
> AFAICS this race condition has always been there; certainly at
> least since Vadim put in MVCC, and it looks likely that the
> original Berkeley code had a form of the problem.
>
> The correct fix is that the tqual.c routines should check commit status
> in the following way:
>
> if (TransactionIdIsInProgress(xid))
> // still in progress, don't touch tuple
> else if (TransactionIdDidCommit(xid))
> // committed, mark accordingly
> else
> // must be aborted or crashed, mark accordingly
>
> rather than what they have traditionally done:
>
> if (TransactionIdDidCommit(xid))
> // committed, mark accordingly
> else if (TransactionIdDidAbort(xid))
> // aborted, mark accordingly
> else
> // assume still in progress, don't touch tuple
>
> Vadim recognized that the former logic was necessary for VACUUM to use
> in deciding if it could clean up dead tuples, but apparently he never
> made the extrapolation that it should be used *everywhere* transaction
> status is tested.
>
>
> The other interesting thing we saw was an Assertion failure. The
> postmaster log showed
>
> WARNING: relation "t1" page 196 is uninitialized --- fixing
> TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
> LOG: server process (PID 11296) was terminated by signal 6
>
> The WARNING could only have come from VACUUM. (Jan's testbed does
> launch a VACUUM every so often.) The Assert failure is in
> RelationGetBufferForTuple where it is adding a new page to a table.
>
> I interpret this as the guy doing RelationGetBufferForTuple added a
> page, but before he could initialize it, he lost control for long enough
> for a VACUUM to scan through the entire table, see the zeroed page, and
> "fix" it. Then when the first guy got control again, his Assert saying
> the page was zeroes failed. The window for this exists because bufmgr/smgr
> do physically extend the file, but when the new buffer is returned
> to the caller, it is only pinned, not locked. So it is possible for
> VACUUM to acquire lock on the new page before RelationGetBufferForTuple
> does. (This is exceedingly improbable, because VACUUM would have to
> scan through the entire relation first --- it wouldn't examine the
> new page at all unless it was included in RelationGetNumberOfBlocks
> at the start of VACUUM's scan. We have not been able to reproduce the
> crash, but we did see it once.)
>
> At first I thought this didn't have any conseqences worse than an
> Assert, since after all both VACUUM and the original extender are just
> trying to set up the page as empty. But consider this scenario:
>
> 1. Process A adds a zero page and gets blocked ahead of initializing it.
>
> 2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it
> in FSM.
>
> 3. Process C takes the page from FSM and puts some new tuples in it.
>
> 4. Process A initializes the page, thus discarding C's tuples.
>
> This could not happen if Asserts are enabled, because A would Assert
> before re-initializing the page (and that's inside a lock so it is
> sure to happen that way). But with Asserts off, it's within the realm
> of possibility under sufficiently heavy load.
>
> This risk has been there since lazy VACUUM was created (it cannot
> happen with VACUUM FULL because that takes an exclusive lock on the
> whole table). I believe the best fix is:
>
> 1. Change RelationGetBufferForTuple to hold the "relation extension"
> lock until after it's exclusive-locked the new buffer.
>
> 2. Adjust VACUUM to not try to "fix" a page without getting the relation
> extension lock. If the page is still uninitialized after obtaining
> that lock, then it must be a really lost page, not one that someone
> is still in process of setting up.
>
> There is a similar risk in btree indexes, with a similar fix.
>
>
> We are currently testing candidate patches for these problems,
> and so far haven't seen any further failures.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2005-05-08 02:40:36 Re: pl/pgsql enabled by default
Previous Message Mike Mascari 2005-05-08 02:20:55 Re: pl/pgsql enabled by default