Heap Only Tuples (HOT) Introduction ------------ Added in PostgreSQL 8.3, Heap Only Tuples (HOT) allow the reuse of space taken by DELETEd or obsoleted UPDATEd tuples without performing a table-wide vacuum. It does this by allowing single-page vacuuming, also called "pruning". Technical Challenges -------------------- PostgreSQL's MVCC system makes single-page vacuums (pruning) quite difficult because rows must remain after UPDATE or DELETE until all transaction snapshots that were active during the command have completed. Traditionally, VACUUM must be run at a later time which sequentially scans the table and collects information about all obsolete rows. VACUUM also removes index entries for obsolete rows. Unfortunately, VACUUM can be an expensive operation because of its full table scan and index cleanup requirement. HOT adds several features that allow space reuse on a per-heap-page basis, particularly by eliminating the problem of index cleanup necessary to reuse the space take by obsolete rows. UPDATE Chains With a Single Index Entry --------------------------------------- Without HOT, every version of a row in an UPDATE chain has its own index entry, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same does not get a new index entry and is marked with the HEAP_ONLY_TUPLE flag. (The old row is marked as HEAP_HOT_UPDATE.) This allows the space taken by UPDATEd row versions to be reused during a single-page vacuum (pruning) when they is no longer visible to any running transactions. This is possible because there is only one index entry for the entire UPDATE chain on the heap page. Internally, things are a bit more complicated: Index points to 1 ctid [1] [2] [111111111]->[2222222222] In the above diagram, the index points to ctid 1, and is marked as HEAP_HOT_UPDATE. Row versions 2 is a HOT UPDATE, meaning it has no index row pointing to it, and is marked as HEAP_HOT_UPDATE. Later, even if row 1 is no longer visible to any transaction, its ctid pointer cannot be removed by pruning because concurrent index/heap lookup activity might be happening on the page and removing it might interfere with other backends. However, the heap space for row 1 can be reused: Index points to 1 ctid [1]->[2] [2222222222] In this case the ctid pointer 1 points to ctid 2, which points to heap row version 2. If row 2 is updated to version 3, it looks like this: [Index points to 1] -------------------------------------------------------- ctid [1]->[2] [3] [2222222222]->[3333333333] The arrow from 2 to 3 is part of the UPDATE chain already present on all update rows. At some later time when no transaction can see row 2 in its snapshot, the space taken by heap row 2 _and_ its ctid can be reused during a pruning, e.g. Index points to 1 ctid [1]------>[3] [3333333333] Notice that HEAP_HOT_UPDATE row 1 now points to row 3, and row 2 is now gone. Again, this is possible because row 2 did not have an index entry. Pruning occurs when a row is UPDATEd and there is no free space on the page containing the old row. Pruning scans the entire page looking for HEAP_HOT_UPDATE and HEAP_ONLY_TUPLE rows that can be removed. Row version 4 would look like this: Index points to 1 ctid [1]------>[3] [4] [3333333333]->[4444444444] and when row 3 is no longer visible, this: Index points to 1 ctid [1]----------->[4] [4444444444] As you can see, ctid 1 has to remain, but the space taken by a ctid is small compare to a heap row. The requirements for doing a HOT update is that none of the indexed columns are changed. That is checked at execution time, comparing the binary representation of the old and new values. That means that dummy updates, like "UPDATE foo SET col1 = ?" where ? is the same as the old value, can be HOT updated. Index/Sequential Scans ---------------------- When doing an index scan, whenever we reach a non-visible tuple, we need to check if the tuple is HEAP_HOT_UPDATE. If so, we need to follow the ctid pointer until we reach a visible one, or one that has not been HOT-updated. Sequential scans (and bitmap heap scans with a lossy bitmap) do not need to pay attention to the flags. Pruning ------- Pruning is a lightweight vacuum operation that can be run on a single page, with no need to scan indexes. Pruning a page makes room on the page for future updates and shortens HOT chains to make index lookups faster. Pruning removes more than just dead HOT tuples. Other dead tuples, such as those from DELETEs and aborted transactions, are truncated and leave behind only a dead line pointer. In the illustration below, ctid 1 is dead and points to no heap row. ctid [1D] [2] [2222222222] The removed tuples are compacted away using PageRepairFragmentation, like in normal vacuum. Pruning happens when accessing a page with HOT updated tuples and with less than a certain threshold of free space. To prune, we need to take a vacuum strength lock on the buffer. If that fails, we do not prune; the theory is that you usually do get the lock, and if you do not, you can get to try again next time. It would be more logical to do the pruning in heap_update() when the page is full, but by the time we get there we have already pinned the page and have references to tuples on it, so we cannot start moving tuples around it. Also, that alone would not address the desire to keep HOT chains short to avoid the overhead of traversing long chains on index lookups. To reclaim the index-pointed (HEAP_HOT_UPDATE) tuple in a HOT chain, its line pointer is turned into a redirecting line pointer that points to the line pointer of the next tuple in the chain and its heap space is reused. When the last live tuple in an update chain becomes dead (after a DELETE or a cold update), the redirecting line pointer is marked as redirected dead. That allows us to immediately reuse the heap space (but not the line pointer itself). We have effectively implemented the "truncate dead tuples to just line pointer" idea that has been proposed and rejected before because of fear of line pointer bloat. To limit the damage in the worst case, and to keep numerous arrays as well as the bitmaps in bitmap scans reasonably sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is arbitrarily capped at twice its previous maximum. VACUUM FULL ----------- To make vacuum full work, any DEAD tuples in the middle of an update chain need to be removed (see comments at the top of heap_prune_hotchain_hard() for details). Vacuum full performs a more aggressive pruning that not only removes dead tuples at the beginning of an update chain, but scans the whole chain and removes any intermediate dead tuples as well. VACUUM ------ There is little change to regular vacuum. It removes dead HOT tuples, like pruning does, and cleans up any redirected dead line pointers. In lazy vacuum, we must not freeze a tuple that is in the middle of an update chain. That can happen when a tuple has xmin > xmax; it is the same scenario that requires "hard pruning" in VACUUM FULL. Freezing such tuples will break the check that xmin and xmax matches when following the chain. It is not a problem without HOT, because the preceding tuple in the chain must be dead as well so no one will try to follow the chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone might still need to follow the chain to find the live tuple. We avoid that by just not freezing such tuples. They can be frozen eventually, when the xmax of the preceding tuple is < OldestXmin as well. Statistics ---------- XXX: How do HOT-updates affect statistics? How often do we need to run autovacuum? CREATE INDEX ------------ CREATE INDEX presents a problem for HOT updates. While the existing HOT chains all have the same index values, the new index might invalidate the HOT chain because the columns in the new index might change within HOT chains. To address this issue, regular (non-concurrent) CREATE INDEX makes the new index usable only by transactions newer than the CREATE INDEX command. This prevents transactions that can see the inconsistent HOT chains from trying to use the new index and getting incorrect results. New transactions can only see the rows visible after the index was created, hence the HOT chains are consistent for them. The new index indexes only Root tuples (tuples with current index pointers) so that our index uses the same index pointers as all other indexes on the table. However the row we want to index is actually at the _end_ of the chain, ie, the most recent tuple on the index chain. But, again, because new transactions will skip over inconsistent HOT rows, this is not a problem. (If we find any HOT-updated tuples with RECENTLY_DEAD or DELETE_IN_PROGRESS we ignore it assuming that we will also come across the _end_ of the update chain and index that instead.) Practically, we prevent old transactions from using the new index by putting our transaction id in pg_index.indcreatexid after building the index (but only if HOT chains exist in the table). Queries check whether indcreatexid is in their serializable snapshot; if it is not then the index is not available for them. This means that transactions started before the create index commits will not get the benefit of the new index even if their first scan of table is only after the index exists. However new transactions get the benefit of the new index immediately because they will always follow the HOT update chain to the end. (The old tuples with the possibly different keys will never be visible to them.) The tricky case arises with queries executed in the same transaction as CREATE INDEX. In the case of a new table created within the same transaction (such as with pg_dump), the index will be usable because there will never be any HOT update chains so the indcreatexid will never be set. Also in the case of a read-committed transaction new queries will be able to use the index. A serializable transaction building an index on an existing table with HOT updates cannot not use the index. CREATE INDEX CONCURRENTLY ------------------------- In the concurrent case we take a different approach. We create the pg_index entry immediately, before we scan the table. pg_index is marked as "not ready for inserts". Then we commit and wait for any transactions which have the table open to finish. This ensures that no _new_ HOT updates will change the key value for our new index. After waiting for transactions which had the table open, we build the index with a snapshot. Because we waited for anyone who existed before the index was created, any tuples seen in the snapshot will have only valid forward-growing HOT chains. (They might still have older HOT updates behind them which are broken.) As above, we point the index pointers at the Root of the HOT-update chain but we use the key from the end of the chain. We mark the index open for inserts (but still not ready for reads) then we again wait for transactions which have the table open. Then we take a second reference snapshot and validate the index. This searches for tuples missing from the index --- but it again has to look up the root of the HOT chains and search for those tuples in the index. Then we wait until _every_ transaction in progress in the validate_index reference snapshot is finished. This ensures that nobody is alive any longer who could possibly see any of the tuples in a broken HOT chain. Glossary -------- Broken HOT Chain A HOT chain in which the key value for an index has changed. This is not allowed to occur normally but if a new index is created it can happen. In that case various strategies are used to ensure that no transaction for which the older tuples are visible can use the index. Cold update A normal, non-HOT update. DEAD_CHAIN (HEAPTUPLE_DEAD_CHAIN) New return value for HeapTupleSatisfiesVacuum, which means that the tuple is not visible to anyone, but it has been HOT updated so we cannot remove it yet because the following tuples in the chain would become inaccessible from indexes. Heap-only tuple A heap tuple with no index pointers. Marked with HEAP_ONLY_TUPLE flag. HOT update An UPDATE where the new tuple becomes a heap-only-tuple, and no index entries are made. HOT-updated tuple An updated tuple, so that the next tuple in the chain is a heap-only tuple. Marked with HEAP_HOT_UPDATE flag. Redirecting line pointer A line pointer that points to another line pointer. lp_len is set to a magic value (ITEMID_REDIRECTED), and lp_off is the OffsetNumber of the line pointer it points to. Redirected dead line pointer A stub line pointer, that does not point to anything, but cannot be removed or reused yet because there is index pointers to it. Semantically same as a dead tuple. Root tuple The first tuple in a HOT update chain that indexes point to. Update chain A chain of updated tuples, so that each tuple's ctid points to the next tuple in the chain. A HOT update chain is an update chain that consists of a root tuple and one or more heap-only tuples. An update chain can contain both HOT and non-HOT (cold) updated tuples.