Re: Reproducible GIST index corruption under concurrent updates

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Duncan Sands <duncan(dot)sands(at)deepbluecap(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Reproducible GIST index corruption under concurrent updates
Date: 2021-01-19 22:31:30
Message-ID: 907491b3-87c7-aadd-3fb1-be3c542b7fb7@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 19/01/2021 23:53, Heikki Linnakangas wrote:
> On 19/01/2021 19:12, Andrey Borodin wrote:
>>> 19 янв. 2021 г., в 16:15, Duncan Sands <duncan(dot)sands(at)deepbluecap(dot)com> написал(а):
>>>
>>> Enjoy!
>>
>> Thanks!
>> It reproduces on my machine. Seems like all tuples are indexed, but some root downlinks are not adjusted properly. Probably, after concurrent split.
>
> This is reproducable down to v12. I bisected it to this commit:
>
> commit e2e992c93145cfc0e3563fb84efd25b390a84bb9 (refs/bisect/bad)
> Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
> Date: Wed Jul 24 20:24:05 2019 +0300
>
> Refactor checks for deleted GiST pages.
>
> The explicit check in gistScanPage() isn't currently really
> necessary, as
> a deleted page is always empty, so the loop would fall through without
> doing anything, anyway. But it's a marginal optimization, and it
> gives a
> nice place to attach a comment to explain how it works.
>
> Backpatch to v12, where GiST page deletion was introduced.
>
> Reviewed-by: Andrey Borodin
> Discussion:
> https://www.postgresql.org/message-id/835A15A5-F1B4-4446-A711-BF48357EB602%40yandex-team.ru
>
> I'll dig deeper tomorrow.

The culprit is this change:

> @@ -858,12 +856,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
> * leaf/inner is enough to recognize split for root
> */
> }
> - else if (GistFollowRight(stack->page) ||
> - stack->parent->lsn < GistPageGetNSN(stack->page))
> + else if ((GistFollowRight(stack->page) ||
> + stack->parent->lsn < GistPageGetNSN(stack->page)) &&
> + GistPageIsDeleted(stack->page))
> {
> /*
> - * The page was split while we momentarily unlocked the
> - * page. Go back to parent.
> + * The page was split or deleted while we momentarily
> + * unlocked the page. Go back to parent.
> */
> UnlockReleaseBuffer(stack->buffer);
> xlocked = false;

The comment change was correct, but the condition used &&. Should've
been ||. There is another copy of basically the same condition earlier
in the function that was changed correctly, but I blundered this one. Oops.

The attached patch fixes this. I also added an assertion to the
gistplacetopage() function, to check that we never try to insert on a
deleted page. This bug could've made that happen too, although in this
case the problem was a concurrent split, not a deletion. I'll backpatch
and push this tomorrow.

Many thanks for the easy reproducer script, Duncan!

- Heikki

Attachment Content-Type Size
fix-gist-corruption-bug.patch text/x-patch 807 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Simon Riggs 2021-01-20 06:58:09 Bug in error reporting for multi-line JSON
Previous Message Heikki Linnakangas 2021-01-19 21:53:56 Re: Reproducible GIST index corruption under concurrent updates