Re: Tricky bugs in concurrent index build

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Tricky bugs in concurrent index build
Date: 2006-08-24 17:14:09
Message-ID: 6915.1156439649@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Unless someone's got a brilliant idea, my recommendation at this point
>> is that we restrict the patch to building only non-unique indexes.

> I assume you'll add the check?

Yeah, I'll take care of it.

At the moment it may be moot, because I've realized that validate_index
doesn't work anyway. It is scanning the index and then assuming that
any tuple inserted into the index subsequent to that scan will still be
INSERT_IN_PROGRESS when the heapscan reaches it. That's completely
bogus of course --- the tuple could be committed live or even already
deleted by the time the heapscan reaches it. This would result in
making duplicate index entries, which is unacceptable even in a
nonunique index (it'd cause indexscans to return the same row twice).

I'm trying to work out whether we can fix this by taking an MVCC
snapshot before we scan the index, and then inserting all tuples we find
in the heap that are live according to the snap but are not present in
the indexscan data. There are still race conditions in this but I think
sufficient delay would fix them. Considerations:

* If a tuple is good according to the snap, it must have been inserted
by an xact that committed before the snap, therefore there is no
still-in-progress index insertion happening for it. So if it's not in
the sorted indexscan data we know we must insert it. If it is in the
indexscan data, of course we don't have to.

* If a tuple is not good according to the snap, there are three
possibilities:

** It was never committed good at all. We need not index it.

** It was inserted by a transaction that committed post-snap. We assume
that that transaction has or will index it.

** It was deleted by a transaction that committed pre-snap. We assume
we need not index it because no remaining transaction will care about it.

The trick is that we must wait long enough to ensure that those two
assumptions are good. We already know about waiting long enough to
ensure that all live transactions are aware of the index, which takes
care of the first assumption as long as we take the snap after that
wait. However, the second assumption means that we must be sure there
are no other backends with serializable snapshots older than the snap
we are using for this. I think this means that we wait for all xacts
to be aware of our index, take the reference snap, then wait for all
xacts except ours to die --- without exception --- and then we can
get on with the second phase of the work.

It might be OK to do the indexscan and sort before we do the second
wait, though, so we could get at least something done.

Comments? Have I missed any case for a tuple to fail the snap?

Does this analysis change our conclusions about whether we can
support unique index builds?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-08-24 17:15:48 Re: Autovacuum on by default?
Previous Message Andrew J. Kopciuch 2006-08-24 16:58:48 Re: tsvector/tsearch equality and/or portability issue issue ?