Skip site navigation (1) Skip section navigation (2)

Re: Patch: Write Amplification Reduction Method (WARM)

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Jaime Casanova <jaime(dot)casanova(at)2ndquadrant(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch: Write Amplification Reduction Method (WARM)
Date: 2017-02-21 12:12:45
Message-ID: CABOikdMWMS71HaN4RtRuUehZVGJ8_z_VL6GpkmbNSMfBTyFb+Q@mail.gmail.com (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-hackers
On Thu, Feb 2, 2017 at 6:17 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
wrote:

>
> Please see rebased patches attached. There is not much change other than
> the fact the patch now uses new catalog maintenance API.
>
>
Another rebase on current master.

This time I am also attaching a proof-of-concept patch to demonstrate chain
conversion. The proposed algorithm is mentioned in the README.WARM, but
I'll briefly explain here.

The chain conversion works in two phases and requires another index pass
during vacuum. During first heap scan, we collect candidate chains for
conversion. A chain qualifies for conversion if it has all tuples with
matching index keys with respect to all current indexes (i.e. chain becomes
HOT). WARM chains become HOT as and when old versions retire (or new
versions retire in case of aborts). But before we can mark them HOT again,
we must first remove duplicate (and potentially wrong) index pointers. This
algorithm deals with that.

When a WARM update occurs and we insert a new index entry in one or more
indexes, we mark the new index pointer with a special RED flag. The heap
tuple created by this UPDATE is also marked as RED. If the tuple is then
HOT-updated, subsequent versions will be marked RED as well. IOW each WARM
chain has two HOT chains inside it and these chains are identified as BLUE
and RED chains. The index pointer which satisfies key in RED chain is
marked RED too.

When we collect candidate WARM chains in the first heap scan, we also
remember the color of the chain.

During first index scan we delete all known dead index pointers (same as
lazy_tid_reaped). Also we also count number of RED and BLUE pointers to
each candidate chain.

The next index scan will either 1. remove an index pointer which is known
to be useless or 2. color a RED pointer BLUE.
- A BLUE pointer to a RED chain is removed when there exists a RED pointer
to the chain. If there is no RED pointer, we can't remove the BLUE pointer
because that is the only path to the heap tuple (case when WARM does not
cause new index entry). But we instead color the heap tuples BLUE
- A BLUE pointer to a BLUE chain is always retained
- A RED pointer to a BLUE chain is always removed (aborted updates)
- A RED pointer to a RED chain is colored BLUE (we will color the heap
tuples BLUE in the second heap scan)

Once the index pointers are taken care of such that there exists exactly
one pointer to a chain, the chain can be converted into HOT chains by
clearing WARM and RED flags.

There is one case of aborted vacuums. If a crash happens after coloring RED
pointer BLUE, but before we can clear the heap tuples, we might end up with
two BLUE pointers to a RED chain. This case will require recheck logic and
is not yet implemented.

The POC only works with BTREEs because the unused bit in IndexTuple's
t_info is already used by HASH indexes. For heap tuples, we can reuse one
of HEAP_MOVED_IN/OFF bits for marking tuples RED since this is only
required for WARM tuples. So the bit can be checked along with WARM bit.

Unless there is an objection to the design or someone thinks it cannot
work, I'll look at some alternate mechanism to free up more bits in tuple
header or at least in the index tuples. One idea is to free up 3 bits from
ip_posid knowing that OffsetNumber can never really need more than 13 bits
with the other constraints in place. We could use some bit-field magic to
do that with minimal changes. The thing that concerns me is whether there
will be a guaranteed way to make that work on all hardwares without
breaking the on-disk layout.

Comments/suggestions?

Thanks,
Pavan

-- 
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment: 0003_convert_chains_v12.patch
Description: application/octet-stream (69.9 KB)
Attachment: 0002_warm_updates_v12.patch
Description: application/octet-stream (120.6 KB)
Attachment: 0001_track_root_lp_v12.patch
Description: application/octet-stream (38.4 KB)
Attachment: 0000_interesting_attrs.patch
Description: application/octet-stream (11.6 KB)

In response to

pgsql-hackers by date

Next:From: Pavan DeolaseeDate: 2017-02-21 12:16:59
Subject: Re: Patch: Write Amplification Reduction Method (WARM)
Previous:From: Andres FreundDate: 2017-02-21 11:37:05
Subject: Re: Should we cacheline align PGXACT?

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group