Re: Index corruption with CREATE INDEX CONCURRENTLY

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index corruption with CREATE INDEX CONCURRENTLY
Date: 2017-02-02 15:18:00
Message-ID: CABOikdPRx-5Q2_EfyQEHKokt1ixjRNicT3BkxUapv7SeNs74=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
wrote:

>
> Based on my investigation so far and the evidence that I've collected,
> what probably happens is that after a relcache invalidation arrives at the
> new transaction, it recomputes the rd_indexattr but based on the old,
> cached rd_indexlist. So the rd_indexattr does not include the new columns.
> Later rd_indexlist is updated, but the rd_indexattr remains what it was.
>
>
I was kinda puzzled this report did not get any attention, though it's
clearly a data corruption issue. Anyways, now I know why this is happening
and can reproduce this very easily via debugger and two sessions.

The offending code is RelationGetIndexAttrBitmap(). Here is the sequence of
events:

Taking the same table as in the report:

CREATE TABLE cic_tab_accounts (
aid bigint UNIQUE,
abalance bigint,
aid1 bigint
);

1. Session S1 calls DROP INDEX pgb_a_aid1 and completes
2. Session S2 is in process of rebuilding its rd_indexattr bitmap (because
of previous relcache invalidation caused by DROP INDEX). To be premise,
assume that the session has finished rebuilding its index list. Since DROP
INDEX has just happened,
4793 indexoidlist = RelationGetIndexList(relation);

3. S1 then starts CREATE INDEX CONCURRENTLY pgb_a_aid1 ON
cic_tab_accounts(aid1)
4. S1 creates catalog entry for the new index, commits phase 1 transaction
and builds MVCC snapshot for the second phase and also finishes the initial
index build
5. S2 now proceeds. Now notice that S2 had computed the index list based on
the old view i.e. just one index.

The following comments in relcahce.c are now significant:

4799 /*
4800 * Copy the rd_pkindex and rd_replidindex value computed by
4801 * RelationGetIndexList before proceeding. This is needed because
a
4802 * relcache flush could occur inside index_open below, resetting
the
4803 * fields managed by RelationGetIndexList. (The values we're
computing
4804 * will still be valid, assuming that caller has a sufficient lock
on
4805 * the relation.)
4806 */

That's what precisely goes wrong.

6. When index_open is called, relcache invalidation generated by the first
transaction commit in CIC gets accepted and processed. As part of this
invalidation, rd_indexvalid is set to 0, which invalidates rd_indexlist too
7. But S2 is currently using index list obtained at step 2 above (which has
only old index). It goes ahead and computed rd_indexattr based on just the
old index.
8. While relcahce invalidation processing at step 6 cleared rd_indexattr
(along with rd_indexvalid), it is again set at
4906 relation->rd_indexattr = bms_copy(indexattrs);

But what gets set at step 8 is based on the old view. There is no way
rd_indexattr will be invalidated until S2 receives another relcache
invalidation. This will happen when the CIC proceeds. But until then, every
update done by S2 will generate broken HOT chains, leading to data
corruption.

I can reproduce this entire scenario using gdb sessions. This also explains
why the patch I sent earlier helps to solve the problem.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2017-02-02 15:18:45 Re: PoC: Make it possible to disallow WHERE-less UPDATE and DELETE
Previous Message Bruce Momjian 2017-02-02 15:16:29 Re: PoC: Make it possible to disallow WHERE-less UPDATE and DELETE