Re: amcheck verification for GiST

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: amcheck verification for GiST
Date: 2019-03-28 13:35:07
Message-ID: 6D0F90B2-4515-477C-8BF3-31E9FE64C2E1@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for looking into this!

> 27 марта 2019 г., в 22:29, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 27/03/2019 11:51, Andrey Borodin wrote:
>> Hi!
>> Here's new version of GiST amcheck, which takes into account recently committed GiST VACUUM.
>> It tests that deleted pages do not contain any data.
>
> Thanks! I had a look, and refactored it quite a bit.
Cool! New scan logic is much easier to read.

> I found the way the recursion worked confusing. On each internal page, it looped through all the child nodes, checking the consistency of the downlinks. And then it looped through the children again, to recurse. This isn't performance-critical, but visiting every page twice still seems strange.
>
> In gist_check_page_keys(), if we get into the code to deal with a concurrent update, we set 'parent' to point to a tuple on a parent page, then unlock it, and continue to look at remaining tuples, using the pointer that points to an unlocked buffer.
Uh, that was a tricky bug.

> I came up with the attached, which fixes the above-mentioned things. I also replaced the check that each node has only internal or leaf children, with a different check that the tree has the same height in all branches. That catches more potential problems, and was easier to implement after the refactoring. This still needs at least a round of fixing typos and tidying up comments, but it's more straightforward now, IMHO.
>
> What have you been using to test this?
Please see attached patch with line
//if (false) // THIS LINE IS INTENTIONALLY BROKEN
which breaks GiST consistency. Uncomment it to create logically broken GiST.

Also, I've fixed buffer release and small typo (childblkno->parentblkno).

To test that it does not deadlock with inserts and vacuums I use pgbench the same way is it is used here [0] but also with
SELECT gist_index_parent_check('gist_check_idx');

Also I've added NOTICE when parent is not refound. To test this, I was removing adjust here
if (stack->parenttup && gistgetadjusted(rel, stack->parenttup, idxtuple, state))

> XXX: shouldn't we rather use gist_consistent?
No, consistency test must use scan strategy, which can be absent from opclass.
Parent tuples must be adjusted with every child. We check this simple invariant, it should cover all corner cases.

> 28 марта 2019 г., в 4:57, Peter Geoghegan <pg(at)bowt(dot)ie> написал(а):
>
>
>> In gist_check_page_keys(), if we get into the code to deal with a
>> concurrent update, we set 'parent' to point to a tuple on a parent page,
>> then unlock it, and continue to look at remaining tuples, using the
>> pointer that points to an unlocked buffer.
>
> Why not just copy the page into a local buffer? See my remarks on this below.
It already was copied, there was a bug of reusing copy from released buffer. In case, when we suspected error, which outed to be not error.

>
> You probably didn't mean to leave this "boom" error behind:
>
>> + if (GistPageIsDeleted(page))
>> + {
>> + elog(ERROR,"boom");
Oops. Sorry.

> I see that you have this check for deleted pages:
>
>> + if (PageGetMaxOffsetNumber(page) > InvalidOffsetNumber)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>> + errmsg("index \"%s\" has deleted page with tuples",
>> + RelationGetRelationName(rel))));
>> + }
>
> Why not have a similar check for non-deleted pages, whose maxoffset
> must be <= MaxIndexTuplesPerPage?
Done.

>
> I see various errors like this one:
>
>> + if (MAXALIGN(ItemIdGetLength(iid)) != MAXALIGN(IndexTupleSize(idxtuple)))
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>> + errmsg("index \"%s\" has inconsistent tuple sizes",
>> + RelationGetRelationName(rel))));
>
> Can't we say which TID is involved here, so we can find the offending
> corrupt tuple afterwards? Or at least the block number? And maybe even
> the LSN of the page? I think that that kind of stuff could be added in
> a number of places.
I've added block number and offset whenever known. I do not understand point of LSN here...

>
> I see this stuff that's related to concurrent processes:
>
>> + /*
>> + * It's possible that the page was split since we looked at the parent,
>> + * so that we didn't missed the downlink of the right sibling when we
>> + * scanned the parent. If so, add the right sibling to the stack now.
>> + */
>
>> + /*
>> + * There was a discrepancy between parent and child tuples.
>> + * We need to verify it is not a result of concurrent call
>> + * of gistplacetopage(). So, lock parent and try to find downlink
>> + * for current page. It may be missing due to concurrent page
>> + * split, this is OK.
>> + */
>
> Is this really needed? Isn't the ShareLock on the index sufficient? If so, why?
There may be concurrent inserts? In GiST they can reorder items on page.

>
>> + stack->parenttup = gist_refind_parent(rel, stack->parentblk, stack->blkno, strategy);
>
> If the gistplacetopage() stuff is truly necessary, then is it okay to
> call gist_refind_parent() with the original buffer lock still held
> like this?
When we call gist_refind_parent() we hold lock for a child and lock parent.
We exclude concurrent VACUUM, thus parent cannot become a child for current child, because it has to be recycled for such coincidence.

>
> I still suspect that we should have something like palloc_btree_page()
> for GiST, so that we're always operating on a copy of the page in
> palloc()'d memory.
That's actually is what we are doing now. GistScanItem->parenttup is always a copy. But we are doing more small copies of each individual tuple, because we have to refresh this copies sometimes.

> Maybe it's worthwhile to do something clever with
> concurrently holding buffer locks, but if that's what we're doing here
> then I would also expect us to have something weaker than ShareLock as
> our relation-level heavyweight lock. And, there should be a prominent
> explanation of the theory behind it somewhere.
We definitely can run under SharedLock. And I will try to compose up all things that prevent us from using weaker levels in next message...

>
>> What have you been using to test this?
>
> pg_hexedit has full support for GiST. ;-)
For me it is easier to break GiST in it's code for tests :)

Thanks!

Best regards, Andrey Borodin.

[0]https://www.postgresql.org/message-id/flat/96ec7ebd-42b9-4df5-18a4-42181c8a5a41%40iki.fi#be331694fd0c8f2a575b3e51a4306e72

Attachment Content-Type Size
0001-GiST-verification-function-for-amcheck-v7.patch application/octet-stream 20.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-03-28 13:48:15 Re: basebackup checksum verification
Previous Message Andres Freund 2019-03-28 13:34:15 Re: jsonpath