Heap WARM Tuples - Design Draft

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Heap WARM Tuples - Design Draft
Date: 2016-08-04 10:59:09
Message-ID: CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Write Amplification Reduction Method (WARM)
====================================

A few years back, we developed HOT to address the problem associated with
MVCC and frequent updates and it has served us very well. But in the light
of Uber's recent technical blog highlighting some of the problems that are
still remaining, especially for workloads where HOT does not help to the
full extent, Simon, myself and others at 2ndQuadrant discussed several
ideas and what I present below is an outcome of that. This is not to take
away credit from anybody else. Others may have thought about similar ideas,
but I haven’t seen any concrete proposal so far.

At present, WARM looks to be a better idea than LITE, though we have that
as a backstop also (says Simon).

Background
==========

As a background, HOT primarily helps in two ways.

1. It avoids additional index entries when a tuple is updated in a way such
that no index columns are changed. Before we implemented HOT, every update
would generate a new index entry in every index on the table. This not only
resulted in bigger and bloater indexes, but also made vacuuming difficult
because heap tuples can't be removed without first removing the
corresponding index entries. HOT made it possible to avoid additional
index entries whenever possible and ensured most dead heap tuples can be
removed at a page level.

2. HOT also helped other cases such as DELETE and non-HOT updates by
reducing such heap tuples to a DEAD line pointer stub, until a regular
vacuum happens. But the space consumed by the heap tuple is freed and
returned to the FSM for future utilization.

Given that HOT was going to change the very core of MVCC and how tuples are
stored and accessed, we made a decision to keep things simple and two
important conditions were spelled out for doing HOT update.

1. There must be enough space in the same page to store the new version of
the tuple.
2. None of the index columns must be updated.

While it was generally easy to satisfy the first condition by either
leaving some free space in heap pages or system would anyways come to a
stable state after initial few updates. The second condition is somewhat
harder to control. Those who are aware would carefully design their schema
and choose indexes so that HOT conditions are met. But that may not be
possible for every workload, as seen from the Uber example. But many others
may have faced similar situations.

This proposal is to relax the second condition, so it’s not quite HOT, just
WARM.

Basic Idea
========

The basic idea is to allow updates that change an index column, without
requiring every index on the table to cause a new index entry. We still
require the first HOT condition i.e. the heap page must have sufficient
free space to store the new version of the tuple.

Indexes whose values do not change do not require new index pointers. Only
the index whose key is being changed will need a new index entry. The new
index entry will be set to the CTID of the root line pointer.

This method succeeds in reducing the write amplification, but causes other
issues which also need to be solved. WARM breaks the invariant that all
tuples in a HOT chain have the same index values and so an IndexScan would
need to re-check the index scan conditions against the visible tuple
returned from heap_hot_search(). We must still check visibility, so we only
need to re-check scan conditions on that one tuple version.

We don’t want to force a recheck for every index fetch because that will
slow everything down. So we would like a simple and efficient way of
knowing about the existence of a WARM tuple in a chain and do a recheck in
only those cases, ideally O(1). Having a HOT chain contain a WARM tuple is
discussed below as being a “WARM chain”, implying it needs re-check.

We could solve this by either 1) marking both old and new index tuples as
"recheck-required" or 2) marking the HOT chain on the heap as
"recheck-required" such that any tuple returned from the chain requires a
recheck.

The advantage of doing 1) is that the recheck is required only for the
compromised index.
Note that in presence of multiple index keys pointing to the same root line
pointer, the chain needs re-check for both the old as well as the new index
key. The old index key must not fetch the new tuple(s) and the new index
key must not fetch the old tuple(s). So irrespective of whether an old or a
new tuple is being returned, the caller must know about the WARM tuples in
the chains. So marking the index would require us to mark both old and new
index tuples as requiring re-check. That requires an additional index scan
to locate the old row and then an additional write to force it to re-check,
which is algorithmically O(NlogN) on table size.

Marking the heap, (2) is O(1) per updated index, so seems better. But since
we don't know with respect to which index or indexes the chain is broken,
all index scans must do a recheck for a tuple returned from a WARM chain.

How do we mark a chain as WARM? How do we clear the flag? There are a few
possible approaches:

1. Mark the first WARM tuple which causes HOT chain to become WARM with a
special HEAP_RECHECK_REQUIRED flag (this flag must then be carried over to
any further additions to the chain. Then whenever a tuple if returned from
a chain, we must look forward into the chain for WARM tuple and let the
caller know about potential breakage. This seems simple but would require
us to follow HOT chain every time to check if it’s HOT or WARM.

2. Mark the root line pointer (or the root tuple) with a special
HEAP_RECHECK_REQUIRED flag to tell us about the presence of a WARM tuple in
the chain. Since all indexes point to the root line pointer, it should be
enough to just mark the root line pointer (or tuple) with this flag.

3. We could also adopt a hybrid approach and mark the new index tuple and
new heap tuple with HEAP_RECHECK_REQUIRED flag and presence of either of
the flag in either tuple forces recheck.

Approach 2 seems more reasonable and simple.

There are only 2 bits for lp_flags and all combinations are already used.
But for LP_REDIRECT root line pointer, we could use the lp_len field to
store this special flag, which is not used for LP_REDIRECT line pointers.
So we are able to mark the root line pointer.

Let me explain the idea in more detail using an example and technique 2)

CREATE TABLE test (col1 int, col2 int, col3 int, col4 text);
CREATE INDEX testindx_col1 ON test (col1);
CREATE INDEX testindx_col2 ON test (col2);
CREATE INDEX testindx_col3 ON test (col3);

INSERT INTO test VALUES (1, 11, 111, 'foo');

Let’s assume a single heap tuple and three indexes on three different
columns. Lets assume that the heap tuple has CTID (0,1) at this point. At
this point, each index has exactly one index entry pointing to CTID (0,1)

testindx_col1: (1) ===> (0,1)
testindx_col2: (11) ===> (0,1)
testindx_col3: (111) ===> (0,1)

Now if a transaction T1 UPDATEs the table such as, "UPDATE test SET col1 =
2". This does not satisfy the HOT property since index column 'col1' is
being updated. But as per WARM algorithm, since only testindx_col1's index
key has changed, we insert a new index tuple only in testindx_col1. Say the
new heap tuple has CTID (0,2). So testindx_col1 will have a new index tuple
with key (2), but still pointing to CTID (0,1)

testindx_col1: (1) ===> (0,1)
(2) ===> (0,1)
testindx_col2: (11) ===> (0,1)
testindx_col3: (111) ===> (0,1)

The heap at this point has two tuples, linked by CTID chain.
(0,1)* ---> (0,2)

Both these tuples are reachable via all indexes, starting at the root
(0,1). A transaction which started before T1 should see (0,1) and those
started after T1 should see (0,2). The MVCC visibility will ensure that the
correct tuple is returned. But if a new transaction T2 is looking up (col1
= 1), while MVCC will return (0,2), the tuple does not satisfy the index
condition. Similarly, for an old transaction T0 looking up (col1 = 2), MVCC
will return (0,1) but that tuple does not satisfy the index condition
either. IndexScan will ensure that tuples are rechecked in such instances.

If T3 later updates col2 and T4 further updates col3,
T3: UPDATE test SET col2 = 12;
T4: UPDATE test SET col3 = 112;

The heap will look like:
(0,1)* ---> (0,2) ---> (0,3) ---> (0,4)

And the index pointers will be:

testindx_col1: (1) ===> (0,1)
(2) ===> (0,1)
testindx_col2: (11) ===> (0,1)
(12) ===> (0,1)
testindx_col3: (111) ===> (0,1)
(112) ===> (0,1)

The (*) here indicates that there may exist a WARM tuple in the chain and
index recheck is required. Our algorithm to recheck heap tuple when it’s
returned from a WARM chain, should ensure that only valid and matching
tuples are returned by the index scans.

Marking index tuples DEAD
=====================
When a WARM chain becomes DEAD and replaced by just a LP_DEAD stub,
referencing index tuples can be marked as killed, and removed by VACUUM.

HOT Pruning
==========
No change is required for HOT pruning other than to copy the
HEAP_CHECK_REQUIRED flag to the line pointer when root heap tuple is pruned
and the root line pointer is turned into redirected line pointer. Except
that, HOT prune can do its job i.e. remove dead tuples from the HOT or WARM
chains and if necessary turn root line pointers into redirect or dead line
pointers.

Clearing the flags
==============
We can't clear the flag until stale index pointers go, even if the chain is
all HOT again. In other words, the flag must remain set until there exists
a wrong index key pointing to the root tuple.

One idea is to somehow do this as part of the vacuum where we collect root
line pointers of WARM chains during the first phase of heap scan, check
the indexes for all such tuples (may be combined with index vacuum) and
then clear the heap flags during the second pass, unless new tuples are
added to the WARM chain. We can detect that by checking that all tuples in
the WARM chain still have XID less than the OldestXmin that VACUUM is using.

We don’t really need to count all references to a root of WARM chain. What
we need to know is if there exists an index which has more than one pointer
to the root tuple. If yes, the flag cannot be cleared. ISTM that even two
bits per htid will be enough to track this state. Other ideas welcome.

WAL Logging
===========

It’s important to ensure that the flag is set when it is absolutely
necessary, while having false positives is not a problem. We might do a
little wasteful work if the flag is incorrectly set. Since flag will be set
only during heap_update() and the operation is already WAL logged, this can
be piggybacked with the heap_update WAL record. Similarly, when a root
tuple is pruned to a redirect line pointer, the operation is already WAL
logged and we can piggyback setting of line pointer flag with that WAL
record.

Flag clearing need not be WAL logged, unless we can piggyback that to some
existing WAL logging.

Vacuum
========

Vacuum should continue to do what it does today i.e. prune HOT chains and
collect LP_DEAD line pointers and remove associated index entries and then
mark line pointers as LP_UNUSED. It could be extended to do additional work
of clearing the flags to avoid rechecks when they are not any more required
as discussed above.

Testing
=====

It will be straightforward to have a developer-level index_recheck GUC
parameter.

Index_recheck = off (default) | on

For testing, compare results between two queries, one with index_recheck=on
and one with index_recheck=off. If this finds any difference we’ll know we
have a bug.

We can use this in testing the feature in development and also use it in
the field if we suspect problems.

Comments/suggestions?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ildar Musin 2016-08-04 11:28:30 Unused argument in PinBuffer function
Previous Message Etsuro Fujita 2016-08-04 10:11:49 Re: Oddity in EXPLAIN for foreign/custom join pushdown plans