Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Date: 2018-08-30 16:15:56
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Wed, Aug 29, 2018 at 11:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> If you include heap TID as a column the suffix will be unique and cannot
> benefit from suffix truncation.

Right. During a page split, we must generate a new high key that's
less than or equal to all items on the left side (where the new high
key goes), and strictly greater than all items on the right side
(where the new high key values will be used as a downlink to). We'll
only include heap TID explicitly in the new high key when there is no
other way to get a value that is strictly less than all tuples on the
right side, and only at the leaf level (we don't truncate during
internal page splits, and the rules automatically work at non-leaf
levels because values come from the leaf level).

The original L&Y invariant that I propose to restore requires that
downlinks be strictly less than all items on the child page the point
to. Truncated attributes logically retain the value minus infinity,
which is often what makes the true L&Y invariant hold.

> It would be better to avoid including the heap TID as a column, since it is
> already there. Or the other way around, don't include it as payload if it is
> has to be a column.

The high level principle to bear in mind is that pivot tuples (high
keys at the leaf level, and all internal page tuples) only need to
correctly guide searches. There is no obligation for pivot tuples to
be able to retrieve the original values, for something like an
index-only scan. We could even make their representation into a flat
binary string someday -- this is called "key normalization" [1].
Likewise, pivot tuples are good candidates for automatic prefix
compression because there is scarcely any need to decompress them (not
working on that for v12, though).

Nothing changes about the direct representation of 99%+ of tuples
(those at the leaf level that point to the heap -- non-pivot tuples).
We need an alternative representation for heap TID in pivot tuples for
historical reasons (actually, I think what we have here would even
work best in a green field situation).

Not including a special heap TID representation in pivots, but
including everything else is what I've called "logical truncation" on
this thread. If you compare a serial primary key case with my patch
and without, you typically see no difference in the size of the index.
It is possible to end up with a larger index overall, but that should
be extremely rare, and only affect very small indexes.

> Alternatively, include only the heap block number. That would make it
> non-unique, but never more than 240 duplicates. So it would allow suffix
> truncation, and yet still avoid the multi-page split effect.

I'm not sure what you mean by the multi-page split effect -- all
internal pages are split recursively, during a leaf page split. In
general, free space management needs to remain as simple as possible,
and splitting the same page twice in one go is a big no-no.

I conservatively enforce that the 1/3 of a page restriction has a bit
of extra slop for an extra heap TID. There is one choke point that
everything takes place -- during leaf page splits. Nothing else
changes, apart from index scan logic, which may have to search with an
extra heap TID column, plus one or two other small things along those

Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?= 2018-08-30 16:44:02 [PATCH] Tab completion for ALTER DATABASE … SET TABLESPACE
Previous Message Gunnlaugur Thor Briem 2018-08-30 16:03:11 Re: pg_upgrade fails saying function unaccent(text) doesn't exist