Quick Links

Re: Index corruption with CREATE INDEX CONCURRENTLY

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index corruption with CREATE INDEX CONCURRENTLY
Date:	2017-02-04 14:11:33
Message-ID:	CABOikdMPEe6UazAmyCwLQskA-+BbmGtAhbJZePxS64rQ-KeMOw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

>
>
> If we do above, then I think primary key attrs won't be returned
> because for those we are using relation copy rather than an original
> working copy of attrs. See code below:
>
> switch (attrKind)
> {
> ..
> case INDEX_ATTR_BITMAP_PRIMARY_KEY:
> return bms_copy(relation->rd_pkattr);
>
>
I don't think that would a problem since if relation_rd_indexattr is NULL,
primary key attrs will be recomputed and returned.

>
> Apart from above, I think after the proposed patch, it will be quite
> possible that in heap_update(), hot_attrs, key_attrs and id_attrs are
> computed using different index lists (consider relcache flush happens
> after computation of hot_attrs or key_attrs) which was previously not
> possible. I think in the later code we assume that all the three
> should be computed using the same indexlist. For ex. see the below
> code:
>

This seems like a real problem to me. I don't know what the consequences
are, but definitely having various attribute lists to have different view
of the set of indexes doesn't seem right.

The problem we are trying to fix is very clear by now. Relcache flush
clears rd_indexvalid and rd_indexattr together, but because of the race,
rd_indexattr is recomputed with the old information and gets cached again,
while rd_indexvalid (and rd_indexlist) remains unset. That leads to the
index corruption in CIC and may cause other issues too that we're not aware
right now. We want cached attribute lists to become invalid whenever index
list is invalidated and that's not happening right now.

So we have tried three approaches so far.

1. Simply reset rd_indexattr whenever we detect rd_indexvalid has been
reset. This was the first patch. It's a very simple patch and has worked
for me with CACHE_CLOBBER_ALWAYS and even fixes the CIC bug. The only
downside it seems that we're invalidating cached attribute lists outside
the normal relcache invalidation path. It also works on the premise that
RelationGetIndexList() will be called every time
before RelationGetIndexAttrBitmap() is called, otherwise we could still end
up using stale cached copies. I wonder if cached plans can be a problem
here.

2. In the second patch, we tried to recompute attribute lists if a relcache
flush happens in between and index list is invalidated. We've seen problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.

3. We tried the third approach where we don't cache attribute lists if know
that index set has changed and hope that the next caller gets the correct
info. But Amit rightly identified problems with that approach too.

So we either need to find a 4th approach or stay with the first patch
unless we see a problem there too (cached plans, anything else). Or may be
fix CREATE INDEX CONCURRENTLY so that at least the first phase which add
the index entry to the catalog happens with a strong lock. This will lead
to momentary blocking of write queries, but protect us from the race. I'm
assuming this is the only code that can cause the problem, and I haven't
checked other potential hazards.

Ideas/suggestions?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: Index corruption with CREATE INDEX CONCURRENTLY at 2017-02-04 06:40:14 from Amit Kapila

Responses

Re: Index corruption with CREATE INDEX CONCURRENTLY at 2017-02-05 20:14:59 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2017-02-04 15:43:04	Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message	amul sul	2017-02-04 13:06:25	Re: Constraint exclusion failed to prune partition in case of partition expression involves function call