Re: CONCURRENT INDEXing again (was: Must be owner to truncate?)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)skype(dot)net>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Mike Mascari <mascarm(at)mascari(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: CONCURRENT INDEXing again (was: Must be owner to truncate?)
Date: 2005-07-12 16:20:09
Message-ID: 19968.1121185209@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hannu Krosing <hannu(at)skype(dot)net> writes:
> I try to state my reworked idea of concurrent indexing in a more clear
> way:

> The index build in done 2 transactions, need one new type of lock and a
> new system column in pg_class to tell planner not to use an incomplete
> index. This similar to vacuum in thet ot needs its own transactions and
> is not rollbackable.

"Not rollbackable" is certainly "not acceptable"... what if there's a
crash partway through? VACUUM FULL is designed so that there isn't
anything special that needs to be done to clean up after it if it fails
partway through, but this seems to be untrue of your proposal. You'd
have committed-created but useless indexes that have to be got rid of
somehow.

> To be sure that we cover all the tuples in range <= MAX_CTID_1, and no
> new tuples are stored there as the result of INSERT or UPDATE, we need a
> new type of lock (lets call it "TupleStoreRangeLock"), which prevents
> new tuples to be placed below MAX_CTID_1 and which is aquired before
> starting the initial build.

Checking for such a lock will put a nontrivial distributed cost on every
insert and update, whether the facility is in use or not.

> After the initial build of index for tuples below MAX_CTID_1 is
> finished, it is made visible to the rest of the system by committing the
> transaction, but marking the index as "incomplete" (probably a new
> column in pg_class is needed for that), so that it will not be used by
> planner, but all new inerts/updates will see and use it.

I can see how this might work for a new index build, but it doesn't work
for REINDEX --- you get to have other transactions inserting into either
the old or the new index, not both. You could possibly make it work by
creating a complete new index with its own OID, and then swapping that
out for the old index at completion --- but see below.

> After that we need to wait for all other running transactions to
> complete, so we can be sure that all other backends know about the new
> index.

That could be a pretty long time ... and presumably the index build is
pretty long, too, or we'd not be bothering with all this mechanism.
All the while, tuples are getting inserted into highly nonoptimal pages
far from the start of the table. Doesn't this idea conflict rather
badly with your desire expressed nearby to force tuples to be inserted
near the front of the table?

> As the final step we need to scan all tuples in range ( MAX_CTID_1 to
> MAX_CTID_2 ) and insert their corresponding index entries into the new
> index. If the entry already exists for exact same ctid, that is ok.

I don't think that's as easy as it sounds; at the very least it requires
mechanism comparable to the unique-index checking code, which we don't
have for any index type except btree. Also, in an index that's actually
highly non-unique, that mechanism is *expensive* --- you may have to
scan many pages of identically-keyed entries to see if any of them match
the target ctid ... all the while holding a lock that prevents anyone
else from inserting on the same starting page.

What's more, the pseudo uniqueness check has to be done on both sides
--- else index build might decide the entry's not there, insert it, only
to have the original tuple inserter come along right after and insert
again. So this is a second major operational mode that has to affect
everybody in the system, not only index build. I'm not sure whether
there are race conditions associated with getting in and out of this
mode, but it wouldn't surprise me.

> After reaching MAX_CTID_2, the index is ready for use by planner as well
> and can be marked as "complete" in pg_class. In case of REINDEX, the new
> index can be made to replace the old one at this point.

AFAICS, the "replace" bit requires exclusive lock to make sure that no
one is in the midst of using the old index. This means that you have a
situation where you need to upgrade your table lock at the very end of
the operation --- which means the whole thing is prone to failing at the
very end because of deadlock.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-07-12 16:33:35 Re: [PATCHES] thousands comma numeric formatting in psql
Previous Message Bruce Momjian 2005-07-12 15:50:46 Re: [PATCHES] thousands comma numeric formatting in psql