Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-11-25 08:00:16
Message-ID: CAFBsxsGB4xAMtGnUEny=CyhKrW2DwBOPDeUnjBRvJ0N2K9TGeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> So it seems that there are two candidates of rt_node structure: (1)
> all nodes except for node256 are variable-size nodes and use pointer
> tagging, and (2) node32 and node128 are variable-sized nodes and do
> not use pointer tagging (fanout is in part of only these two nodes).
> rt_node can be 5 bytes in both cases. But before going to this step, I
> started to verify the idea of variable-size nodes by using 6-bytes
> rt_node. We can adjust the node kinds and node classes later.

First, I'm glad you picked up the size class concept and expanded it. (I
have some comments about some internal APIs below.)

Let's leave the pointer tagging piece out until the main functionality is
committed. We have all the prerequisites in place, except for a benchmark
random enough to demonstrate benefit. I'm still not quite satisfied with
how the shared memory coding looked, and that is the only sticky problem we
still have, IMO. The rest is "just work".

That said, (1) and (2) above are still relevant -- variable sizing any
given node is optional, and we can refine as needed.

> Overall, the idea of variable-sized nodes is good, smaller size
> without losing search performance.

Good.

> I'm going to check the load
> performance as well.

Part of that is this, which gets called a lot more now, when node1 expands:

+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);

Since memset for expanding size class is now handled separately, these can
use the non-zeroing versions. When compiling MemoryContextAllocZero, the
compiler has no idea how big the size is, so it assumes the worst and
optimizes for large sizes. On x86-64, that means using "rep stos",
which calls microcode found in the CPU's ROM. This is slow for small sizes.
The "init" function should be always inline with const parameters where
possible. That way, memset can compile to a single instruction for the
smallest node kind. (More on alloc/init below)

Note, there is a wrinkle: As currently written inner_node128 searches the
child pointers for NULL when inserting, so when expanding from partial to
full size class, the new node must be zeroed (Worth fixing in the short
term. I thought of this while writing the proof-of-concept for size
classes, but didn't mention it.) Medium term, rather than special-casing
this, I actually want to rewrite the inner-node128 to be more similar to
the leaf, with an "isset" array, but accessed and tested differently. I
guarantee it's *really* slow now to load (maybe somewhat true even for
leaves), but I'll leave the details for later. Regarding node128 leaf, note
that it's slightly larger than a DSA size class, and we can trim it to fit:

node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280

> I've attached the patches I used for the verification. I don't include
> patches for pointer tagging, DSA support, and vacuum integration since
> I'm investigating the issue on cfbot that Andres reported. Also, I've
> modified tests to improve the test coverage.

Sounds good. For v12, I think size classes have proven themselves, so v11's
0002/4/5 can be squashed. Plus, some additional comments:

+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk,
bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}

I don't see the point of a function that just calls two functions.

+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}

This, in turn, just calls a function that does _almost_ everything, and
additionally must set one member. This function should really be alloc-node
+ init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;

And init-node should really be just memset + set kind + set initial fanout.
It has no business touching "shift" and "chunk". The callers rt_new_root,
rt_set_extend, and rt_extend set some values of their own anyway, so let
them set those, too -- it might even improve readability.

- if (n32->base.n.fanout ==
rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch means "node doesn't need to
grow class", which is false. If we want to simplify expressions in this
block, I think it'd be more effective to improve the lines that follow:

+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;

Maybe we can have const variables old_size and new_fanout to break out the
array lookup? While I'm thinking of it, these arrays should be const so the
compiler can avoid runtime lookups. Speaking of...

+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

When I looked at this earlier, I somehow didn't go far enough -- why are we
passing the runtime count in the first place? This function can only be
called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last
parameter to memcpy should evaluate to a compile-time constant, right? Even
when we add node shrinking in the future, the constant should be correct,
IIUC?

- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+ and this is the max size class so it will never grow */
+ .fanout = 0,

- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);

These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2022-11-25 08:05:13 Re: Remove a unused argument from qual_is_pushdown_safe
Previous Message Anton A. Melnikov 2022-11-25 07:40:43 Re: [PATCH] Add peer authentication TAP test