Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Andres Freund <andres(at)anarazel(dot)de>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-07-04 21:18:22
Message-ID: 20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> new file mode 100644
> index 0000000000..bf87f932fd
> --- /dev/null
> +++ b/src/backend/lib/radixtree.c
> @@ -0,0 +1,1763 @@
> +/*-------------------------------------------------------------------------
> + *
> + * radixtree.c
> + * Implementation for adaptive radix tree.
> + *
> + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> + * Neumann, 2013.
> + *
> + * There are some differences from the proposed implementation. For instance,
> + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> + * Also, there is no support for path compression and lazy path expansion. The
> + * radix tree supports fixed length of the key so we don't expect the tree level
> + * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.

> + * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.

> +/*
> + * As we descend a radix tree, we push the node to the stack. The stack is used
> + * at deletion.
> + */
> +typedef struct radix_tree_stack_data
> +{
> + radix_tree_node *node;
> + struct radix_tree_stack_data *parent;
> +} radix_tree_stack_data;
> +typedef radix_tree_stack_data *radix_tree_stack;

I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

> +/*
> + * Allocate a new node with the given node kind.
> + */
> +static radix_tree_node *
> +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> +{
> + radix_tree_node *newnode;
> +
> + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> + radix_tree_node_info[kind].size);
> + newnode->kind = kind;
> +
> + /* update the statistics */
> + tree->mem_used += GetMemoryChunkSpace(newnode);
> + tree->cnt[kind]++;
> +
> + return newnode;
> +}

Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?

> + else if (idx != n4->n.count)
> + {
> + /*
> + * the key needs to be inserted in the middle of the
> + * array, make space for the new key.
> + */
> + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> + sizeof(uint8) * (n4->n.count - idx));
> + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> + sizeof(radix_tree_node *) * (n4->n.count - idx));
> + }

Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.

> +/*
> + * Insert the key with the val.
> + *
> + * found_p is set to true if the key already present, otherwise false, if
> + * it's not NULL.
> + *
> + * XXX: do we need to support update_if_exists behavior?
> + */

Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.

> +void
> +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> +{
> + int shift;
> + bool replaced;
> + radix_tree_node *node;
> + radix_tree_node *parent = tree->root;
> +
> + /* Empty tree, create the root */
> + if (!tree->root)
> + radix_tree_new_root(tree, key, val);
> +
> + /* Extend the tree if necessary */
> + if (key > tree->max_val)
> + radix_tree_extend(tree, key);

FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-07-04 21:22:28 Re: doc: BRIN indexes and autosummarize
Previous Message Andres Freund 2022-07-04 20:55:55 Re: [PoC] Improve dead tuple storage for lazy vacuum