Re: Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
Date: 2017-10-31 17:37:30
Message-ID: 20171031173730.cxl6gkye6jrfccsm@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> So in a few more runs this morning using Alvaro's simplified test case,
> I have seen the following behaviors not previously reported:

> 1. Crashes in PageIndexTupleOverwrite, which has the same "invalid index
> offnum: %u" error report as PageIndexTupleDeleteNoCompact. I note the
> same message appears in plain PageIndexTupleDelete as well.
> I think we've been too hasty to assume all instances of this came out of
> PageIndexTupleDeleteNoCompact.

Ah, I wasn't paying close attention to the originator routine of the
message, but you're right, I see this one too.

> 2. Crashes in the data-insertion process, not only the process running
> summarize_range:

Yeah, I saw these. I was expecting it, since the two routines
(brininsert and summarize_range) pretty much share the insertion
protocol.

> I really don't understand how any of this "let's release the buffer
> lock and then take it back later" logic is supposed to work reliably.

Yeah, evidently that was way too optimistic and I'll need to figure out
a better mechanism to handle this.

The intention was to avoid deadlocks while locking the target page for
the insertion: by having both pages start unlocked we can simply lock
them in block number order. If we keep the page containing the tuple
locked, I don't see how to reliably avoid a deadlock while acquiring a
buffer to insert the new tuple.

> BTW, while I'm bitching, it seems fairly insane from a concurrency
> standpoint that brin_getinsertbuffer is calling RecordPageWithFreeSpace
> while holding at least one and possibly two buffer locks. Shouldn't
> that be done someplace else?

Hmm. I spent a lot of effort (commit ccc4c074994d) to avoid leaving
pages uninitialized / unrecorded in FSM. I left this on purpose on the
rationale that trying to fix it would make the callsites more convoluted
(the retry logic doesn't help). But as I recall this was supposed to be
done only in the rare case where the buffer could not be returned to
caller ... but that's not what the current code does, so there is
something wrong there.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-10-31 17:42:41 Re: Fix dumping pre-10 DBs by pg_dump10 if table "name" exists
Previous Message Peter Eisentraut 2017-10-31 17:23:03 SQL procedures