Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Pawel Kudzia <kudzia(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date: 2021-07-17 00:30:03
Message-ID: 1871889b-f5ef-f4d4-8d42-65a9e863e0a9@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 17/07/2021 02:40, Peter Geoghegan wrote:
> On Thu, Jul 15, 2021 at 3:56 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> I still think this is worth checking. Despite the pending list wasn't
>> involved in the index scan with wrong results, the bug could be
>> related to insertion to the pending list. Or it could be related to
>> moving entries from the pending list to the posting list/tree.
>
> If I had to guess I'd say that the chances of the pending list being
> involved are high.
>
> shiftList() deletes pages in the pending list -- this is called from
> ginInsertCleanup(). But shiftList() doesn't call GinPageSetDeleteXid()
> to set an XID that represents when the page is safe to recycle, which
> is what ginDeletePage() always does. Why is that okay?
>
> Note that we usually use read/share buffer locks when lock-coupling
> inside ginInsertCleanup() -- so AFAICT we won't block-out concurrent
> readers with a link that's about to go stale due to recycling of the
> page. This looks unsafe. Of course it's very hard to tell what might
> be going on, since none of this seems to be explained anywhere.
>
> Even ginDeletePage() didn't do the right thing with XIDs until bugfix
> commit 52ac6cd2d0. That commit didn't touch any pending list code --
> it should of at least explained why ginInsertCleanup()/shiftList()
> don't need to use the GinPageSetDeleteXid() stuff.

Hmm, seems we should fix that. But could a prematurely recycled deleted
page cause permanent corruption?

Here's what I've found out so far, looking at the traces Pawel sent:

> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 0 (ginutil.c:636 ginGetStats)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 0 (ginget.c:1866 scanPendingInsert)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 383295 (ginget.c:1894 scanPendingInsert)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 383298 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 2 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 3 (ginget.c:1518 scanGetCandidate)
> [few hundred calls from scanGetCandidate omitted]
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 200586 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 200854 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 1452 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-16 07:01:19 UTC LOG: GINBUG: startScanKey called: excludeOnly 0 nentries 1
> 2021-07-16 07:01:19 UTC LOG: GINBUG: called triConsistentFn(1): 1 0
> 2021-07-16 07:01:19 UTC LOG: GINBUG: called triConsistentFn(1): 1 0
> 2021-07-16 07:01:19 UTC LOG: GINBUG: called triConsistentFn(1): 1 0
> ...

So, it begins by reading the metadata page, and then it scans the
pending list. There are a few hundred pages in the pending list, but no
matches are found there.

Then, it reads the root page at blk 1. And then, it reads page 1452.

I used pg_filedump on the root page (blk 1):

> Block 1 ********************************************************
> <Header> -----
> Block Offset: 0x00002000 Offsets: Lower 1376 (0x0560)
> Block: Size 8192 Version 4 Upper 2776 (0x0ad8)
> LSN: logid 410 recoff 0x04c6b6c0 Special 8184 (0x1ff8)
> Items: 338 Free Space: 1400
> Checksum: 0xde29 Prune XID: 0x00000000 Flags: 0x0000 ()
> Length (including item array): 1376
>
> <Data> -----
> Item 1 -- Length: 16 Offset: 7736 (0x1e38) Flags: NORMAL
> Item 2 -- Length: 16 Offset: 8168 (0x1fe8) Flags: NORMAL
> Item 3 -- Length: 16 Offset: 7800 (0x1e78) Flags: NORMAL
> ...
> Item 337 -- Length: 16 Offset: 7032 (0x1b78) Flags: NORMAL
> Item 338 -- Length: 16 Offset: 8152 (0x1fd8) Flags: NORMAL
>
> <Special Section> -----
> GIN Index Section:
> Flags: 0x00000000 () Maxoff: 0
> Blocks: RightLink (-1)

It has no Flags set, which means that it is an internal entry tree page.
The items on an internal entry tree page should point to other entry
tree pages. Now let's take a look at the next page that it read, blk 1452:

> Block 1452 ********************************************************
> <Header> -----
> Block Offset: 0x00b58000 Offsets: Lower 562 (0x0232)
> Block: Size 8192 Version 4 Upper 8184 (0x1ff8)
> LSN: logid 471 recoff 0xf0cf8a68 Special 8184 (0x1ff8)
> Items: 134 Free Space: 7622
> Checksum: 0x1628 Prune XID: 0x00000000 Flags: 0x0000 ()
> Length (including item array): 560
>
> <Data> -----
> Item 1 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED
> Item 2 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED
> Item 3 -- Length: 7780 Offset: 0 (0x0000) Flags: UNUSED
> Item 4 -- Length: 5990 Offset: 4 (0x0004) Flags: UNUSED
> Item 5 -- Length: 0 Offset: 6 (0x0006) Flags: UNUSED
> Item 6 -- Length: 2 Offset: 15559 (0x3cc7) Flags: UNUSED
> Error: Item contents extend beyond block.
> BlockSize<8192> Bytes Read<8192> Item Start<15561>.
> Item 7 -- Length: 3 Offset: 15372 (0x3c0c) Flags: REDIRECT
> ...
> Item 133 -- Length: 20913 Offset: 3 (0x0003) Flags: UNUSED
> Error: Item contents extend beyond block.
> BlockSize<8192> Bytes Read<8192> Item Start<20916>.
> Item 134 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED
>
> <Special Section> -----
> GIN Index Section:
> Flags: 0x00000001 (DATA) Maxoff: 53
> Blocks: RightLink (-1)

On this page, the DATA flag is set, so it is an internal *posting* tree
page.

That's weird: the scan walked straight from an internal entry tree page
(root, at blk 1) into an internal posting tree page (blk 1452). That
doesn't make sense to me.

The next ReadBuffer call is this:

> 2021-07-16 07:01:19 UTC LOG: ReadBuffer 1663/16390/16526 read gin blk 15559 (ginbtree.c:183 ginStepRight)

Where did block 15559 come from? How come we're stepping right to it?
It's not the right sibling of the previously accessed page, 1452. In
fact, 15559 is a leaf posting tree page. I don't understand how that
sequence of page reads could happen.

After that, the log shows that it follows the right-links from block
15559 onward. The incorrectly returned entries are on posting tree page
255179. The scan reaches that page by following that chain of right-links.

I'm going to continue investigating this on Monday. I'll take a closer
look at the index entries on those blocks to see if I can make sense of it.

- Heikki

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2021-07-17 01:56:47 Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Previous Message Japin Li 2021-07-17 00:11:28 Re: BUG #17111: Database created, cannot be created, but reported as inexist