Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Andres Freund <andres(at)anarazel(dot)de>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-07-04 22:00:38
Message-ID: 20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.

Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.

Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes

I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.

One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().

> > Ideally, node16 and node32 would have the same code with a different
> > loop count (1 or 2). More generally, there is too much duplication of
> > code (noted by Andres in his PoC), and there are many variable names
> > with the node size embedded. This is a bit tricky to make more
> > general, so we don't need to try it yet, but ideally we would have
> > something similar to:
> >
> > switch (node->kind) // todo: inspect tagged pointer
> > {
> > case RADIX_TREE_NODE_KIND_4:
> > idx = node_search_eq(node, chunk, 4);
> > do_action(node, idx, 4, ...);
> > break;
> > case RADIX_TREE_NODE_KIND_32:
> > idx = node_search_eq(node, chunk, 32);
> > do_action(node, idx, 32, ...);
> > ...
> > }

FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:

> > static pg_alwaysinline void
> > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > {
> > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> > // do simple loop with (node_simple *) node;
> > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> > // do vectorized loop where available with (node_vec *) node;
> > ...
> > }

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2022-07-04 22:06:04 Re: TAP output format in pg_regress
Previous Message Daniel Gustafsson 2022-07-04 21:48:02 Re: TAP output format in pg_regress