Duplicate Item Pointers in Gin index

From: "R, Siva" <sivasubr(at)amazon(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Duplicate Item Pointers in Gin index
Date: 2018-02-20 23:30:35
Message-ID: 1519169435557.98185@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

We are currently investigating an issue with a Gin index containing duplicate item pointers for one of its keys. While we are trying to reproduce this issue, we wanted to reach out to the community to validate our current working theory.

Background:
The discussion in the following thread:

https://www.postgresql.org/message-id/flat/CAD21AoBLUSyiYKnTYtSAbC%2BF%3DXDjiaBrOUEGK%2BzUXdQ8owfPKw%40mail(dot)gmail(dot)com#CAD21AoBLUSyiYKnTYtSAbC+F=XDjiaBrOUEGK+zUXdQ8owfPKw(at)mail(dot)gmail(dot)com<https://tinyurl.com/ginInsertCleanupVacuumBug>

is about vacuum skipping cleanup of gin pending list when a user backend has locked the pending list for cleanup.

One of the responses from Jeff Janes describes the source of a corruption. Here is the sequence of events leading up to it

1. Consider a tuple that is inserted and just updated/deleted and passes the dead-to-all horizon. At this point, the old tuple id (corresponding to the insert) is eligible for vacuum.
2. Due to the above bug, vacuum can skip clean up of pending list if a user backend already has a lock on it before proceeding to vacuum the gin posting tree.
3. At this point, it is possible that the user backend flushes the pending list containing the dead tuple to the main gin index structure after vacuum has already passed over. Vacuum has marked that tuple id slot as re-usable when there is actually an index tuple pointing to it.

Given the above, our question is to find out if the above is a valid case for having duplicate tids in the index tree structure without the fix for the above bug (in non-debug builds where "Assert" is disabled) with the following sequence of events:

1. Assume that in the above sequence, steps 1 and 2 have completed.
2. Before the pending list flush from user backend has happened, vacuum has marked that tuple id as re-usable.
3. User backend (let us call this backend 1) has released the lock on the metapage and is processing the pending list pages, adding the items to the build accumulator. Let us assume that there are currently 2 pages in the pending list and metadata->head is currently being processed (metadata->tail is not locked yet) and the old tuple id that was deleted is added to the accumulator.
4. Another user backend (let us call this backend 2) uses the tid that was marked re-usable for a row that contains the same key datum as before, takes metapage lock and inserts into the pending list (metadata->tail) and returns (assume it does not do cleanup).
5. The first backend (backend 1) has finished processing metadata->head, decides not to flush and is now processing metadata->tail page which also contains the old tuple id and adds it to the build accumulator (ginCombineData appends to the end of the list in the accumulator tree - checks for duplicates through a Assert [1] <https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/gin/ginbulk.c#L60> ). So now the build accumulator contains duplicates for that key datum.
6. First backend has finished processing metadata->tail, finds it is the last page of pending list and then decides to flush (one of the conditions to start flushing if we are at end of pending list).
7. Each list of item pointers that is obtained through ginGetBAEntry [2] and inserted into the main tree. ginGetBAEntry will sort the items if already not sorted using qsort and qsortCompareItemPointers method where the check for duplicates happens through a Assert [3]
8. During ginEntryInsert when we actually place the data to the page (ginEntryInsert -> ginInsertItemPointers -> ginInsertValue -> ginPlaceToPage -> beginPlaceToPage [dataBeginPlaceToPage] -> dataBeginPlaceToPageLeaf -> addItemsToLeaf). Assume that this insertion does not trigger a split for simplicity.
9. Function addItemsToLeaf calls ginMergeItemPointers to merge the item pointers in the input with the existing one in the leaf page. The comment on ginMergeItemPointers mentions that it eliminates duplicates [4] , but it eliminates duplicates only across the 2 input arrays, not within the input arrays (Example : ginMergeItemPointers({1, 2, 2, 3}, {3, 4, 5}) will yield {1, 2, 2, 3, 4, 5}).

While this is probably a rare scenario, we were wondering if this is possible and if so, of the consequences that it can have in other parts of the code. One place where we found the strictly increasing order of item pointers is validated is in ginCompressPostingList during encoding the item pointers (through Assert [5] Note that if there are duplicates, the encoding is not broken)

Best
Siva

Links -
[1] - https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/gin/ginbulk.c#L60
[2] - https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/gin/ginfast.c#L878
[3] - https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/gin/ginbulk.c#L250
[4] - https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/gin/ginpostinglist.c#L361
[5] - https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/gin/ginpostinglist.c#L213

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-02-20 23:32:55 Re: ALTER TABLE ADD COLUMN fast default
Previous Message Peter Geoghegan 2018-02-20 23:23:58 Re: Hash Joins vs. Bloom Filters / take 2