From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Arseniy Mukhin <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Andrey Borodin <amborodin86(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Jose Arthur Benetasso Villanova <jose(dot)arthur(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Amcheck verification of GiST and GIN |
Date: | 2025-05-09 14:22:44 |
Message-ID: | ec5a0c8c-1a41-4a81-a54e-e46a200689be@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/9/25 14:43, Arseniy Mukhin wrote:
> Hello,
>
> Thanks everybody for the patch.
>
> I noticed there are no tests that GIN check fails if the index is
> corrupted, so I thought it would be great to have some.
> While writing tests I noticed some issues in the patch (all issues are
> for verify_gin.c)
>
> 1) When we add new items to the entry tree stack, ptr->parenttup is always null
> because GinPageGetOpaque(page)->rightlink is never NULL.
>
> /* last tuple in layer has no high key */
> if (i != maxoff && !GinPageGetOpaque(page)->rightlink)
> ptr->parenttup = CopyIndexTuple(idxtuple);
> else
> ptr->parenttup = NULL;
>
> Right way to check if entry doesn't have right neighbour is
>
> GinPageGetOpaque(page)->rightlink == InvalidBlockNumber
>
> But even if we fix it, the condition would not do what the comment
> says. If we want to have NULL as parenttup
> only for the last tuple of the btree layer, the right check would be:
>
> if (i == maxoff && rightlink == InvalidBlockNumber)
> ptr->parenttup = NULL;
> else
> ptr->parenttup = CopyIndexTuple(idxtuple);
>
> 2) Check don't use attnum in comparisons, but for multicolumn indexes
> attnum defines order. When we compare max entry page key
> with parent key we ignore attnum. It means we occasionally can try to
> compare keys of different columns.
> While checking order within the same page we skip checking order for
> tuples with different attnums now,
> but it can be checked. Fix is easy: using ginCompareAttEntries()
> instead of ginCompareEntries().
>
>
> 3) Here a code of the split detection
>
> if (rightlink != InvalidBlockNumber &&
> ginCompareEntries(&state, attnum, page_max_key,
> page_max_key_category, parent_key,
> parent_key_category) > 0)
> {
> /* split page detected, install right link to the stack */
>
> Condition seems not right, because the child page max item key never
> can be bigger then parent key.
> It can be equal to the parentkey, and it means that there was no split
> and the parent key that we cached in the stack is still
> relevant. Or it could be less then cached parent key and it means that
> split took place and old max item key moved to the
> right neighbour and current page max item key should be less then
> cached parent key. So I think we should replace > with <.
>
> 4) Here is the code for checking the order within the entry page.
>
> /*
> * First block is metadata, skip order check. Also, never check
> * for high key on rightmost page, as this key is not really
> * stored explicitly.
> *
> * Also make sure to not compare entries for different attnums,
> * which may be stored on the same page.
> */
> if (i != FirstOffsetNumber && attnum == prev_attnum &&
> stack->blkno != GIN_ROOT_BLKNO &&
> !(i == maxoff && rightlink == InvalidBlockNumber))
> {
> prev_key = gintuple_get_key(&state, prev_tuple,
> &prev_key_category);
> if (ginCompareEntries(&state, attnum, prev_key,
> prev_key_category, current_key,
> current_key_category) >= 0)
>
> We skip checking the order for the root page, it's not clear why.
> Probably there is some mess with the meta page, because
> comment says "First block is metadata, skip order check". So I think
> we can remove
>
> stack->blkno != GIN_ROOT_BLKNO
>
> 5) The same place as 4). We skip checking the order for the high key
> on the rightmost page, as this key is not really stored explicitly,
> but for leaf pages all keys are stored explicitly, so we can check the
> order for the last item of the leaf page too.
> So I think we can change the condition to this:
>
> !(i == maxoff && rightlink == InvalidBlockNumber &&
> !GinPageIsLeaf(page))
>
> 6) In posting tree parent key check part:
>
> /*
> * Check if this tuple is consistent with the downlink in the
> * parent.
> */
> if (stack->parentblk != InvalidBlockNumber && i == maxoff &&
> ItemPointerCompare(&stack->parentkey, &posting_item->key) < 0)
> ereport(ERROR,
> ...
> Here we don't check if stack->parentkey is valid, so sometimes we
> compare invalid parentkey (because we can have
> valid parentblk and invalid parentkey the same time). Invalid
> parentkey is always bigger, so the code never triggers
> ereport, but it doesn't look right. so probably we can rewrite it this way:
>
> if (i == maxoff && ItemPointerIsValid(&stack->parentkey) &&
> ItemPointerCompare(&stack->parentkey, &posting_item->key) < 0)
>
> 7) When producing stack entries for posting tree check, we set parent
> key like this:
>
> /*
> * Set rightmost parent key to invalid item pointer. Its value
> * is 'Infinity' and not explicitly stored.
> */
> if (rightlink == InvalidBlockNumber)
> ItemPointerSetInvalid(&ptr->parentkey);
> else
> ptr->parentkey = posting_item->key;
>
> We set invalid parent key for all items of the rightmost page. But
> it's the only rightmost item that doesn't have an explicit
> parentkey (actually the comment says exactly this, but the code does a
> different thing). All others have an explicit parent
> key and we can set it. So fix can look like this:
>
> if (rightlink == InvalidBlockNumber && i == maxoff)
> ItemPointerSetInvalid(&ptr->parentkey);
> else
> ptr->parentkey = posting_item->key;
>
> But for (rightlink == InvalidBlockNumber && i == maxoff)
> posting_item->key is always (0,0) (we check it a little bit earlier),
> so I think we can simplify it:
>
> ptr->parentkey = posting_item->key;
>
> 8) In the function gin_refind_parent() the code
>
> if (ItemPointerGetBlockNumber(&(itup->t_tid)) == childblkno)
>
> triggers Assert(ItemPointerIsValid(pointer)) within the
> ItemPointerGetBlockNumber(), because itup->t_tid could be invalid.
> AFAIS GIN code uses a special method GinGetDownlink(itup) that avoids
> this Assert. So we can use it here too.
>
> Please find the attached patch with fixes for issues above. Also there
> are added regression test for multicolumn index,
> and several tap tests with some basic corrupted index cases. I'm not
> sure if it's the right way to write such tests and would
> be glad to hear any feedback, especially about
> invalid_entry_columns_order_test() where it seems important to
> preserve
> byte ordering. Also all tests expect standard page size 8192 now.
>
>
> Also there are several points that I think also worth addressing:
>
> 9) Field 'parentlsn' is set but never actually used for any check. Or
> I missed something.
>
> 10) README says "Vacuum never deletes tuples or pages from the entry
> tree." But check assumes that it's possible to have
> deleted leaf page with 0 entries.
>
> if (GinPageIsDeleted(page))
> {
> if (!GinPageIsLeaf(page))
> ereport(ERROR,
> (errcode(ERRCODE_INDEX_CORRUPTED),
> errmsg("index \"%s\" has deleted internal page %u",
> RelationGetRelationName(rel), blockNo)));
> if (PageGetMaxOffsetNumber(page) > InvalidOffsetNumber)
> ereport(ERROR,
> (errcode(ERRCODE_INDEX_CORRUPTED),
> errmsg("index \"%s\" has deleted page %u with tuples",
> RelationGetRelationName(rel), blockNo)));
> }
> 11) When we compare entry tree max page key with parent key:
>
> if (ginCompareAttEntries(&state, attnum, current_key,
> current_key_category, parent_key_attnum,
> parent_key, parent_key_category) > 0)
> {
> /*
> * There was a discrepancy between parent and child
> * tuples. We need to verify it is not a result of
> * concurrent call of gistplacetopage(). So, lock parent
> * and try to find downlink for current page. It may be
> * missing due to concurrent page split, this is OK.
> */
> pfree(stack->parenttup);
> stack->parenttup = gin_refind_parent(rel, stack->parentblk,
> stack->blkno, strategy);
>
> I think we can remove gin_refind_parent() and do ereport right away here.
> The same logic as with 3). AFAIK it's impossible to have a child item
> with a key that is higher than the cached parent key.
> Parent key bounds what keys we can insert into the child page, so it
> seems there is no way how they can appear there.
>
These look like good points. I've added it to open items so that we
don't forget about this, I won't have time to look at this until after
pgconf.dev.
thanks
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2025-05-09 14:22:57 | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
Previous Message | Peter Geoghegan | 2025-05-09 14:17:39 | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |