Re: REINDEX CONCURRENTLY 2.0

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: REINDEX CONCURRENTLY 2.0
Date: 2014-10-28 22:59:04
Message-ID: 54501FB8.7000607@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 10/1/14, 2:00 AM, Michael Paquier wrote:
> On Wed, Aug 27, 2014 at 3:53 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com <mailto:michael(dot)paquier(at)gmail(dot)com>> wrote:
>
> On Wed, Aug 27, 2014 at 3:41 PM, Andres Freund <andres(at)2ndquadrant(dot)com <mailto:andres(at)2ndquadrant(dot)com>> wrote:
> > Can you add it to the next CF? I'll try to look earlier, but can't
> > promise anything.
> >
> > I very much would like this to get committed in some form or another.
> Added it here to keep track of it:
> https://commitfest.postgresql.org/action/patch_view?id=1563
>
> Attached is a fairly-refreshed patch that should be used as a base for the next commit fest. The following changes should be noticed:
> - Use of AccessExclusiveLock when swapping relfilenodes of an index and its concurrent entry instead of ShareUpdateExclusiveLock for safety. At the limit of my understanding, that's the consensus reached until now.
> - Cleanup of many comments and refresh of the documentation that was somewhat wrongly-formulated or shaped at some places
> - Addition of support for autocommit off in psql for REINDEX [ TABLE | INDEX ] CONCURRENTLY
> - Some more code cleanup..
> I haven't been through the tab completion support for psql but looking at tab-completion.c this seems a bit tricky with the stuff related to CREATE INDEX CONCURRENTLY already present. Nothing huge though.

Patch applies against current HEAD and builds, but I'm getting 37 failed tests (mostly parallel, but also misc and WITH; results attached). Is that expeccted?

+ <para>
+ In a concurrent index build, a new index whose storage will replace the one
+ to be rebuild is actually entered into the system catalogs in one
+ transaction, then two table scans occur in two more transactions. Once this
+ is performed, the old and fresh indexes are swapped by taking a lock
+ <literal>ACCESS EXCLUSIVE</>. Finally two additional transactions
+ are used to mark the concurrent index as not ready and then drop it.
+ </para>

The "mark the concurrent index" bit is rather confusing; it sounds like it's referring to the new index instead of the old. Now that I've read the code I understand what's going on here between the concurrent index *entry* and the filenode swap, but I don't think the docs make this sufficiently clear to users.

How about something like this:

The following steps occur in a concurrent index build, each in a separate transaction. Note that if there are multiple indexes to be rebuilt then each step loops through all the indexes we're rebuilding, using a separate transaction for each one.

1. A new "temporary" index definition is entered into the catalog. This definition is only used to build the new index, and will be removed at the completion of the process.
2. A first pass index build is done.
3. A second pass is performed to add tuples that were added while the first pass build was running.
4. pg_class.relfilenode for the existing index definition and the "temporary" definition are swapped. This means that the existing index definition now uses the index data that we stored during the build, and the "temporary" definition is using the old index data.
5. The "temporary" index definition is marked as dead.
6. The "temporary" index definition and it's data (which is now the data for the old index) are dropped.

+ * index_concurrent_create
+ *
+ * Create an index based on the given one that will be used for concurrent
+ * operations. The index is inserted into catalogs and needs to be built later
+ * on. This is called during concurrent index processing. The heap relation
+ * on which is based the index needs to be closed by the caller.

Last bit presumably should be "on which the index is based".

+ /* Build the list of column names, necessary for index_create */
Instead of all this work wouldn't it be easier to create a version of index_create/ConstructTupleDescriptor that will use the IndexInfo for the old index? ISTM index_concurrent_create() is doing a heck of a lot of work to marshal data into one form just to have it get marshaled yet again. Worst case, if we do have to play this game, there should be a stand-alone function to get the columns/expressions for an existing index; you're duplicating a lot of code from pg_get_indexdef_worker().

index_concurrent_swap(): Perhaps it'd be better to create index_concurrent_swap_setup() and index_concurrent_swap_cleanup() and refactor the duplicated code out... the actual function would then become:

ReindexRelationConcurrently()

+ * Process REINDEX CONCURRENTLY for given relation Oid. The relation can be
+ * either an index or a table. If a table is specified, each step of REINDEX
+ * CONCURRENTLY is done in parallel with all the table's indexes as well as
+ * its dependent toast indexes.
This comment is a bit misleading; we're not actually doing anything in parallel, right? AFAICT index_concurrent_build is going to block while each index is built the first time.

+ * relkind is an index, this index itself will be rebuilt. The locks taken
+ * parent relations and involved indexes are kept until this transaction
+ * is committed to protect against schema changes that might occur until
+ * the session lock is taken on each relation.

This comment is a bit unclear to me... at minimum I think it should be "* on parent relations" instead of "* parent relations", but I think it needs to elaborate on why/when we're also taking session level locks.

I also wordsmithed this comment a bit...
* Here begins the process for concurrently rebuilding the index entries.
* We need first to create an index which is based on the same data
* as the former index except that it will be only registered in catalogs
* and will be built later. It is possible to perform all the operations
* on all the indexes at the same time for a parent relation including
* indexes for its toast relation.

and this one...
* During this phase the concurrent indexes catch up with any new tuples that
* were created during the previous phase.
*
* We once again wait until no transaction can have the table open with
* the index marked as read-only for updates. Each index validation is done
* in a separate transaction to minimize how long we hold an open transaction.

+ * a different valid status to avoid an implosion in the number of indexes
+ * a parent relation could have if this operation step fails multiple times
+ * in a row due to a reason or another.

I'd change that to "explosion in the number of indexes a parent relation could have if this operation fails."

Phase 4, 5 and 6 are rather confusing if you don't understand that each "concurrent index" entry is meant to be thrown away. I think the Phase 4 comment should elaborate on that.

The comment in check_exclusion_constraint() is good; shouldn't the related comment on this line in index_create() mention that check_exclusion_constraint() needs to be changed if we ever support concurrent builds of exclusion indexes?
if (concurrent && is_exclusion && !is_reindex)

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Attachment Content-Type Size
regression.diffs text/plain 816.3 KB
regression.out text/plain 8.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2014-10-28 23:07:25 Re: Trailing comma support in SELECT statements
Previous Message Robert Haas 2014-10-28 22:54:52 Re: Directory/File Access Permissions for COPY and Generic File Access Functions