Heap Only Tuples (HOT)

Introduction
------------

Added in PostgreSQL 8.3, Heap Only Tuples (HOT) allow the reuse of space
taken by DELETEd or obsoleted UPDATEd tuples without performing a
table-wide vacuum.  It does this by allowing single-page vacuuming, also
called "pruning".


Technical Challenges
--------------------

PostgreSQL's MVCC system makes single-page vacuums (pruning) quite
difficult because rows must remain after UPDATE or DELETE until all
transaction snapshots that were active during the command have
completed.  Traditionally, VACUUM must be run at a later time which
sequentially scans the table and collects information about all obsolete
rows.  VACUUM also removes index entries for obsolete rows. 
Unfortunately, VACUUM can be an expensive operation because of its full
table scan and index cleanup requirement.

HOT adds several features that allow space reuse on a per-heap-page
basis, particularly by eliminating the problem of index cleanup
necessary to reuse the space take by obsolete rows.

UPDATE Chains With a Single Index Entry
---------------------------------------

Without HOT, every version of a row in an UPDATE chain has its own index
entry, even if all indexed columns are the same.  With HOT, a new tuple
placed on the same page and with all indexed columns the same does not
get a new index entry and is marked with the HEAP_ONLY_TUPLE flag.  (The
old row is marked as HEAP_HOT_UPDATE.)  This allows the space taken by
UPDATEd row versions to be reused during a single-page vacuum (pruning)
when they is no longer visible to any running transactions.  This is
possible because there is only one index entry for the entire UPDATE
chain on the heap page.

Internally, things are a bit more complicated:

	Index points to 1
	ctid [1]  [2]

	[111111111]->[2222222222]

In the above diagram, the index points to ctid 1, and is marked as
HEAP_HOT_UPDATE.  Row versions 2 is a HOT UPDATE, meaning it has no
index row pointing to it, and is marked as HEAP_HOT_UPDATE. Later, even
if row 1 is no longer visible to any transaction, its ctid pointer
cannot be removed by pruning because concurrent index/heap lookup
activity might be happening on the page and removing it might interfere
with other backends. However, the heap space for row 1 can be reused:

	Index points to 1
	ctid [1]->[2]

	[2222222222]

In this case the ctid pointer 1 points to ctid 2, which points to heap
row version 2.

If row 2 is updated to version 3, it looks like this:

	[Index points to 1]
	--------------------------------------------------------
	ctid [1]->[2]  [3]

	[2222222222]->[3333333333]

The arrow from 2 to 3 is part of the UPDATE chain already present on all
update rows.

At some later time when no transaction can see row 2 in its snapshot,
the space taken by heap row 2 _and_ its ctid can be reused during a
pruning, e.g.

	Index points to 1
	ctid [1]------>[3]

	[3333333333]

Notice that HEAP_HOT_UPDATE row 1 now points to row 3, and row 2 is now
gone.  Again, this is possible because row 2 did not have an index
entry.

Pruning occurs when a row is UPDATEd and there is no free space on the
page containing the old row.  Pruning scans the entire page looking for
HEAP_HOT_UPDATE and HEAP_ONLY_TUPLE rows that can be removed.

Row version 4 would look like this:

	Index points to 1
	ctid [1]------>[3]  [4]

	[3333333333]->[4444444444]

and when row 3 is no longer visible, this:

	Index points to 1
	ctid [1]----------->[4]

	[4444444444]

As you can see, ctid 1 has to remain, but the space taken by a ctid is
small compare to a heap row.

The requirements for doing a HOT update is that none of the indexed
columns are changed. That is checked at execution time, comparing the
binary representation of the old and new values. That means that dummy
updates, like "UPDATE foo SET col1 = ?" where ? is the same as the old
value, can be HOT updated.

Index/Sequential Scans
----------------------

When doing an index scan, whenever we reach a non-visible tuple, we need
to check if the tuple is HEAP_HOT_UPDATE.  If so, we need to follow the
ctid pointer until we reach a visible one, or one that has not been
HOT-updated.

Sequential scans (and bitmap heap scans with a lossy bitmap) do not need
to pay attention to the flags.

Pruning
-------

Pruning is a lightweight vacuum operation that can be run on a single
page, with no need to scan indexes.  Pruning a page makes room on the
page for future updates and shortens HOT chains to make index lookups
faster.

Pruning removes more than just dead HOT tuples. Other dead tuples, such
as those from DELETEs and aborted transactions, are truncated and leave
behind only a dead line pointer.  In the illustration below, ctid 1 is
dead and points to no heap row.

	ctid [1D] [2]

	[2222222222]

The removed tuples are compacted away using PageRepairFragmentation,
like in normal vacuum.

Pruning happens when accessing a page with HOT updated tuples and with
less than a certain threshold of free space.  To prune, we need to take
a vacuum strength lock on the buffer. If that fails, we do not prune;
the theory is that you usually do get the lock, and if you do not, you
can get to try again next time. It would be more logical to do the
pruning in heap_update() when the page is full, but by the time we get
there we have already pinned the page and have references to tuples on
it, so we cannot start moving tuples around it. Also, that alone would
not address the desire to keep HOT chains short to avoid the overhead of
traversing long chains on index lookups.

To reclaim the index-pointed (HEAP_HOT_UPDATE) tuple in a HOT chain, its
line pointer is turned into a redirecting line pointer that points to
the line pointer of the next tuple in the chain and its heap space is
reused.

When the last live tuple in an update chain becomes dead (after a DELETE
or a cold update), the redirecting line pointer is marked as redirected
dead. That allows us to immediately reuse the heap space (but not the
line pointer itself).  

We have effectively implemented the "truncate dead tuples to just line
pointer" idea that has been proposed and rejected before because of fear
of line pointer bloat. To limit the damage in the worst case, and to
keep numerous arrays as well as the bitmaps in bitmap scans reasonably
sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is
arbitrarily capped at twice its previous maximum.

VACUUM FULL
-----------

To make vacuum full work, any DEAD tuples in the middle of an update
chain need to be removed (see comments at the top of
heap_prune_hotchain_hard() for details). Vacuum full performs a more
aggressive pruning that not only removes dead tuples at the beginning of
an update chain, but scans the whole chain and removes any intermediate
dead tuples as well.

VACUUM
------

There is little change to regular vacuum. It removes dead HOT tuples,
like pruning does, and cleans up any redirected dead line pointers.

In lazy vacuum, we must not freeze a tuple that is in the middle of an
update chain. That can happen when a tuple has xmin > xmax; it is the
same scenario that requires "hard pruning" in VACUUM FULL. Freezing such
tuples will break the check that xmin and xmax matches when following
the chain. It is not a problem without HOT, because the preceding tuple
in the chain must be dead as well so no one will try to follow the
chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone
might still need to follow the chain to find the live tuple. We avoid
that by just not freezing such tuples. They can be frozen eventually,
when the xmax of the preceding tuple is < OldestXmin as well.

Statistics
----------

XXX: How do HOT-updates affect statistics? How often do we need to run
autovacuum?

CREATE INDEX
------------

CREATE INDEX presents a problem for HOT updates.  While the existing HOT
chains all have the same index values, the new index might invalidate
the HOT chain because the columns in the new index might change within
HOT chains.

To address this issue, regular (non-concurrent) CREATE INDEX makes the
new index usable only by transactions newer than the CREATE INDEX
command.  This prevents transactions that can see the inconsistent HOT
chains from trying to use the new index and getting incorrect results.  
New transactions can only see the rows visible after the index was
created, hence the HOT chains are consistent for them.

The new index indexes only Root tuples (tuples with current index
pointers) so that our index uses the same index pointers as all other
indexes on the table. However the row we want to index is actually at
the _end_ of the chain, ie, the most recent tuple on the index chain. 
But, again, because new transactions will skip over inconsistent HOT
rows, this is not a problem.

(If we find any HOT-updated tuples with RECENTLY_DEAD or
DELETE_IN_PROGRESS we ignore it assuming that we will also come across
the _end_ of the update chain and index that instead.)

Practically, we prevent old transactions from using the new index by
putting our transaction id in pg_index.indcreatexid after building the
index (but only if HOT chains exist in the table). Queries check whether
indcreatexid is in their serializable snapshot;  if it is not then the
index is not available for them.

This means that transactions started before the create index commits
will not get the benefit of the new index even if their first scan of
table is only after the index exists. However new transactions get the
benefit of the new index immediately because they will always follow the
HOT update chain to the end.  (The old tuples with the possibly
different keys will never be visible to them.)

The tricky case arises with queries executed in the same transaction as
CREATE INDEX. In the case of a new table created within the same
transaction (such as with pg_dump), the index will be usable because
there will never be any HOT update chains so the indcreatexid will never
be set. Also in the case of a read-committed transaction new queries
will be able to use the index. A serializable transaction building an
index on an existing table with HOT updates cannot not use the index.


CREATE INDEX CONCURRENTLY
-------------------------

In the concurrent case we take a different approach.  We create the
pg_index entry immediately, before we scan the table. pg_index is marked
as "not ready for inserts". Then we commit and wait for any transactions
which have the table open to finish. This ensures that no _new_ HOT
updates will change the key value for our new index.

After waiting for transactions which had the table open, we build the
index with a snapshot. Because we waited for anyone who existed before
the index was created, any tuples seen in the snapshot will have only
valid forward-growing HOT chains. (They might still have older HOT
updates behind them which are broken.)  As above, we point the index
pointers at the Root of the HOT-update chain but we use the key from the
end of the chain.

We mark the index open for inserts (but still not ready for reads) then
we again wait for transactions which have the table open. Then we take a
second reference snapshot and validate the index. This searches for
tuples missing from the index --- but it again has to look up the root
of the HOT chains and search for those tuples in the index.

Then we wait until _every_ transaction in progress in the validate_index
reference snapshot is finished. This ensures that nobody is alive any
longer who could possibly see any of the tuples in a broken HOT chain.

Glossary
--------

Broken HOT Chain

	A HOT chain in which the key value for an index has changed.

	This is not allowed to occur normally but if a new index is created
	it can happen. In that case various strategies are used to ensure
	that no transaction for which the older tuples are visible can
	use the index.

Cold update
	A normal, non-HOT update.

DEAD_CHAIN (HEAPTUPLE_DEAD_CHAIN)

	New return value for HeapTupleSatisfiesVacuum, which means that
	the tuple is not visible to anyone, but it has been HOT updated
	so we cannot remove it yet because the following tuples in the
	chain would become inaccessible from indexes.

Heap-only tuple
	A heap tuple with no index pointers. Marked with HEAP_ONLY_TUPLE flag.

HOT update

	An UPDATE where the new tuple becomes a heap-only-tuple, and no
	index entries are made.

HOT-updated tuple

	An updated tuple, so that the next tuple in the chain is a
	heap-only tuple. Marked with HEAP_HOT_UPDATE flag.

Redirecting line pointer

	A line pointer that points to another line pointer. lp_len is set
	to a magic value (ITEMID_REDIRECTED), and lp_off is the
	OffsetNumber of the line pointer it points to.

Redirected dead line pointer

	A stub line pointer, that does not point to anything, but cannot
	be removed or reused yet because there is index pointers to it.
	Semantically same as a dead tuple.

Root tuple
	The first tuple in a HOT update chain that indexes point to.

Update chain

	A chain of updated tuples, so that each tuple's ctid points to
	the next tuple in the chain. A HOT update chain is an update chain
	that consists of a root tuple and one or more heap-only tuples.
	An update chain can contain both HOT and non-HOT (cold) updated
	tuples.