Tricky bugs in concurrent index build

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Tricky bugs in concurrent index build
Date: 2006-08-22 17:26:12
Message-ID: 9364.1156267572@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I see fairly nasty problems in the concurrent-index patch when it's
trying to build a unique index. I think it's solvable but want some
more eyeballs on my reasoning.

Look at the code in IndexBuildHeapScan where we are deciding whether or
not to include a tuple in the index ("indexIt") and also whether or not
to include it in the uniqueness check ("tupleIsAlive"). In a normal
non-concurrent build, we have to include recently-dead tuples in the
index because transactions started before ours might try to use the
index after we finish it. (Remember system catalogs generally operate
on SnapshotNow, so a query could use a newly created index even though
it will be run with a serializable snapshot much older than the index.)
So we have to put tuples into the index if any active transaction might
wish to see those tuples. OTOH, we should exclude dead tuples from the
uniqueness check: the uniqueness constraint ought to be across
currently-valid tuples only. In particular, for tuples previously
created or deleted by our own transaction, we certainly must include
created ones and not include deleted ones in the uniqueness check.

In the past, the only way we could see HEAPTUPLE_INSERT_IN_PROGRESS
or HEAPTUPLE_DELETE_IN_PROGRESS was for tuples created/deleted by our
own transaction, and so the actions taken by IndexBuildHeapScan are
to include in the index in both cases, but exclude DELETE_IN_PROGRESS
tuples from the uniqueness check.

This does not work for a concurrent build, though, because if the
in-progress delete is from another transaction, it could roll back after
we look. In that case we have an entry that is in the index and has
escaped the uniqueness check. If it conflicts with another tuple also
entered into the index in the first pass, we'll never notice that.

I think we can solve this by having IndexBuildHeapScan not index
DELETE_IN_PROGRESS tuples if it's doing a concurrent build. The problem
of old transactions trying to use the index does not exist, because
we'll wait 'em out before marking the index valid, so we need not
worry about preserving validity for old snapshots. And if the deletion
does in fact roll back, we'll insert the tuple during the second pass,
and catch any uniqueness violation at that point.

But wait, there's more: in the patch as it stands, the second pass
over the table ignores DELETE_IN_PROGRESS tuples, which is wrong.
It's entirely possible for a tuple that is RECENTLY_DEAD or
DELETE_IN_PROGRESS to have no entry in the index, if it was inserted
during the first pass, and then the deletion occurred after the first
pass (and either has or hasn't committed yet). If we ignore
DELETE_IN_PROGRESS and then the deleter rolls back, the tuple never
gets into the index at all. Furthermore, the patch also tries to
insert RECENTLY_DEAD tuples, which is good for MVCC coverage, but wrong
for uniqueness checking --- keep in mind that in the second pass,
we are just doing normal index insertions, and so anything we insert
into the index will be uniqueness-checked as though still alive.
We could get a uniqueness failure that should not occur, eg from
trying to insert the old version of a recently-updated row.

What I think we can do about this is to include DELETE_IN_PROGRESS
tuples in the set of candidate tuples to insert in the second pass.
During the merge step that verifies whether the tuple is already
in the index, if we find that it's not, then we must wait for the
deleter to commit or roll back. If the deleter commits then we
ignore the tuple. If the deleter rolls back then we have to insert
the tuple in the index. (I think we have to actually take a FOR
UPDATE or possibly FOR SHARE lock on the tuple while we do this,
else we have race conditions against someone else starting a new
deletion attempt on the tuple.) In the commit case we've essentially
waited long enough to transform DELETE_IN_PROGRESS into RECENTLY_DEAD,
and for both of those statuses we are not going to insert into the
index for fear of causing a false uniqueness violation. What that
means is that after we finish the second pass, we need *another*
wait-for-everyone-to-die before we can mark the index valid. Otherwise
we have the same MVCC problem that someone might try to use the index
for a query with a snapshot old enough that it should be able to see
the not-inserted tuple.

Have I missed anything? This is tricky stuff.

In any case it's clear that the patch still needs major work.
Greg, do you have cycles to spare now?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2006-08-22 18:32:06 Re: BugTracker (Was: Re: 8.2 features status)
Previous Message Andrew Dunstan 2006-08-22 17:11:22 Re: [HACKERS] COPY view