Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-04 06:02:13
Message-ID: ebc1cecb-3246-7a0b-622d-7709a081e687@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Some comments on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
Mostly about code comments. In general, I think a round of copy-editing
the comments, to use simpler language, would do good. The actual code
changes look good to me.

> /*
> * _bt_findinsertloc() -- Finds an insert location for a tuple
> *
> * On entry, *bufptr contains the page that the new tuple unambiguously
> * belongs on. This may not be quite right for callers that just called
> * _bt_check_unique(), though, since they won't have initially searched
> * using a scantid. They'll have to insert into a page somewhere to the
> * right in rare cases where there are many physical duplicates in a
> * unique index, and their scantid directs us to some page full of
> * duplicates to the right, where the new tuple must go. (Actually,
> * since !heapkeyspace pg_upgraded'd non-unique indexes never get a
> * scantid, they too may require that we move right. We treat them
> * somewhat like unique indexes.)

Seems confusing to first say assertively that "*bufptr contains the page
that the new tuple unambiguously belongs to", and then immediately go on
to list a whole bunch of exceptions. Maybe just remove "unambiguously".

> @@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel,
> * If this page was incompletely split, finish the split now. We
> * do this while holding a lock on the left sibling, which is not
> * good because finishing the split could be a fairly lengthy
> - * operation. But this should happen very seldom.
> + * operation. But this should only happen when inserting into a
> + * unique index that has more than an entire page for duplicates
> + * of the value being inserted. (!heapkeyspace non-unique indexes
> + * are an exception, once again.)
> */
> if (P_INCOMPLETE_SPLIT(lpageop))
> {

This happens very seldom, because you only get an incomplete split if
you crash in the middle of a page split, and that should be very rare. I
don't think we need to list more fine-grained conditions here, that just
confuses the reader.

> /*
> * _bt_useduplicatepage() -- Settle for this page of duplicates?
> *
> * Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
> * as a part of the keyspace. If there were many tuples of the same
> * value spanning more than one leaf page, a new tuple of that same
> * value could legally be placed on any one of the pages.
> *
> * This function handles the question of whether or not an insertion
> * of a duplicate into a pg_upgrade'd !heapkeyspace index should
> * insert on the page contained in buf when a choice must be made.
> * Preemptive microvacuuming is performed here when that could allow
> * caller to insert on to the page in buf.
> *
> * Returns true if caller should proceed with insert on buf's page.
> * Otherwise, caller should move on to the page to the right (caller
> * must always be able to still move right following call here).
> */

So, this function is only used for legacy pg_upgraded indexes. The
comment implies that, but doesn't actually say it.

> /*
> * Get tie-breaker heap TID attribute, if any. Macro works with both pivot
> * and non-pivot tuples, despite differences in how heap TID is represented.
> *
> * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
> * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
> * (since all pivot tuples must as of BTREE_VERSION 4). When non-pivot
> * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
> * probably also contain a heap TID at the end of the tuple. We currently
> * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
> * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
> * tuple), but it might also be used by non-pivot tuples in the future.
> * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
> * tuples that actually originated with the truncation of one or more
> * attributes.
> */
> #define BTreeTupleGetHeapTID(itup) ...

The comment claims that "all pivot tuples must be as of BTREE_VERSION
4". I thought that all internal tuples are called pivot tuples, even on
version 3. I think what this means to say is that this macro is only
used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
have a heap TID in BTREE_VERSION 4 indexes.

This macro (and many others in nbtree.h) is quite complicated. A static
inline function might be easier to read.

> @@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel,
>
> if (BufferIsValid(metabuf))
> {
> + Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
> + xlmeta.version = metad->btm_root;
> xlmeta.root = metad->btm_root;
> xlmeta.level = metad->btm_level;
> xlmeta.fastroot = metad->btm_fastroot;

'xlmeta.version' is set incorrectly.

> /*
> * Btree version 4 (used by indexes initialized by PostgreSQL v12) made
> * general changes to the on-disk representation to add support for
> * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
> * semantics in pg_upgrade scenarios. We continue to offer support for
> * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
> * up to and including v10 to v12+ without requiring a REINDEX.
> * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
> * support upgrades from v11 to v12+ without requiring a REINDEX.
> *
> * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
> * to BTREE_NOVAC_VERSION automatically. v11's "no vacuuming" enhancement
> * (the ability to skip full index scans during vacuuming) only requires
> * two new metapage fields, which makes it possible to upgrade at any
> * point that the metapage must be updated anyway (e.g. during a root page
> * split). Note also that there happened to be no changes in metapage
> * layout for btree version 4. All current metapage fields should have
> * valid values set when a metapage WAL record is replayed.
> *
> * It's convenient to consider the "no vacuuming" enhancement (metapage
> * layout compatibility) separately from heapkeyspace semantics, since
> * each issue affects different areas. This is just a convention; in
> * practice a heapkeyspace index is always also a "no vacuuming" index.
> */
> #define BTREE_METAPAGE 0 /* first page is meta */
> #define BTREE_MAGIC 0x053162 /* magic number of btree pages */
> #define BTREE_VERSION 4 /* current version number */
> #define BTREE_MIN_VERSION 2 /* minimal supported version number */
> #define BTREE_NOVAC_VERSION 3 /* minimal version with all meta fields */

I find this comment difficult to read. I suggest rewriting it to:

/*
* The current Btree version is 4. That's what you'll get when you create
* a new index.
*
* Btree version 3 was used in PostgreSQL v11. It is mostly the same as
* version 4, but heap TIDs were not part of the keyspace. Index tuples
* with duplicate keys could be stored in any order. We continue to
* support reading and writing Btree version 3, so that they don't need
* to be immediately re-indexed at pg_upgrade. In order to get the new
* heapkeyspace semantics, however, a REINDEX is needed.
*
* Btree version 2 is the same as version 3, except for two new fields
* in the metapage that were introduced in version 3. A version 2 metapage
* will be automatically upgraded to version 3 on the first insert to it.
*/

Now that the index tuple format becomes more complicated, I feel that
there should be some kind of an overview explaining the format. All the
information is there, in the comments in nbtree.h, but you have to piece
together all the details to get the overall picture. I wrote this to
keep my head straight:

B-tree tuple format
===================

Leaf tuples
-----------

t_tid | t_info | key values | INCLUDE columns if any

t_tid points to the heap TID.

Pivot tuples
------------

All tuples on internal pages are pivot tuples. Also, the high keys on
leaf pages.

t_tid | t_info | key values | [heap TID]

The INDEX_ALT_TID_MASK bit in t_info is set.

The block number in 't_tid' points to the lower B-tree page.

The lower bits in 't_tid.ip_posid' store the number of keys stored (it
can be less than the number of keys in the index, if some keys have been
suffix-truncated). If BT_HEAP_TID_ATTR flag is set, there's an extra
heap TID field after the key datums.

(In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set. In
that case, the number keys is implicitly the same as the number of keys
in the index, and there is no heap TID.)

I think adding something like this in nbtree.h would be good.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-03-04 06:05:39 Re: Online verification of checksums
Previous Message Masahiko Sawada 2019-03-04 05:58:05 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)