Re: Tricky bugs in concurrent index build

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Tricky bugs in concurrent index build
Date: 2006-08-24 20:14:31
Message-ID: 9605.1156450471@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I'm trying to work out whether we can fix this by taking an MVCC
> snapshot before we scan the index, and then inserting all tuples we find
> in the heap that are live according to the snap but are not present in
> the indexscan data. There are still race conditions in this but I think
> sufficient delay would fix them. Considerations:

> * If a tuple is good according to the snap, it must have been inserted
> by an xact that committed before the snap, therefore there is no
> still-in-progress index insertion happening for it. So if it's not in
> the sorted indexscan data we know we must insert it. If it is in the
> indexscan data, of course we don't have to.

> * If a tuple is not good according to the snap, there are three
> possibilities:

> ** It was never committed good at all. We need not index it.

> ** It was inserted by a transaction that committed post-snap. We assume
> that that transaction has or will index it.

> ** It was deleted by a transaction that committed pre-snap. We assume
> we need not index it because no remaining transaction will care about it.

> The trick is that we must wait long enough to ensure that those two
> assumptions are good. We already know about waiting long enough to
> ensure that all live transactions are aware of the index, which takes
> care of the first assumption as long as we take the snap after that
> wait. However, the second assumption means that we must be sure there
> are no other backends with serializable snapshots older than the snap
> we are using for this. I think this means that we wait for all xacts
> to be aware of our index, take the reference snap, then wait for all
> xacts except ours to die --- without exception --- and then we can
> get on with the second phase of the work.

After a couple hours' thought I have not found any flaws in the above
analysis; furthermore I believe that the second wait need not
happen until just before we are ready to mark the index valid and
commit. So that means that after the indexscan and second heapscan,
we wait for any xacts alive before that to end. That's not too bad.

> Does this analysis change our conclusions about whether we can
> support unique index builds?

I also believe that we can support unique index builds after all.
There are two further changes needed besides the above:

* The first pass has to insert tuples that are good according to a
snapshot, rather than using HeapTupleSatisfiesVacuum. This prevents
the problem of trying to unique-check multiple versions of the same
row that all chanced to be committed live at the instants we passed
over them. If ambuild gets a failure it means the dataset was not
unique at the instant of the snap, so the user can hardly complain
about getting the error. This will somewhat reduce the coverage of
the initial index build, but that seems like a small price. (Perhaps
the docs should recommend using a smaller-than-normal fillfactor during
concurrent index creation?)

* We have to add the already-discussed logic to bt_check_unique to
recheck liveness of the to-be-inserted tuple immediately before it
raises a unique-index-violation error. This is a waste of cycles
in the normal case, but since it's in an error path that doesn't
seem like a big problem. (Is it worth extending the aminsert API
so that aminsert knows whether it's being called by CREATE INDEX
CONCURRENTLY or not? I'm inclined not to bother, at least not
right now with the bitmap AM patch still unmerged.)

* When about to insert a tuple in pass 2, we can make the code
check liveness of the tuple and suppress the unique check if it's
already committed dead. This means the bt_check_unique addition
only matters if the tuple goes dead in the interval between starting the
index insert and finishing it. However that's an important case
to handle the concurrent-UPDATE scenario: if we find a conflicting
entry in the index, we'll wait for its inserter to commit, and that
is the same transaction that will have committed our new entry dead.
So the recheck is necessary and sufficient to cover this case.

Barring objections, I'm off to program this.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bernd Helmle 2006-08-24 20:25:46 Re: Updatable views
Previous Message Jonah H. Harris 2006-08-24 19:22:51 Re: PL/Perl: spi_prepare() and RETURNING