Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-06-27 11:12:13
Message-ID: CAFBsxsGOOuf1Sfi=ByO4WBUhqSUWj+P7xWZpmrMK3BA19pOuUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:

[v3 patch]

Hi Masahiko,

Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.

I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.

As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:

> On Thu, Jun 16, 2022 at 4:30 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > For now, though, I'd like to question
> > why we even need to use 32-byte registers in the first place. For one,
> > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > is 48 and uses a different method to find the index of the next
> > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > of his patch a couple weeks ago I don't recall a reason mentioned for
> > it.
>
> I might be wrong but since AVX2 instruction set is introduced in
> Haswell microarchitecture in 2013 and the referenced paper is
> published in the same year, the art didn't use AVX2 instruction set.

Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.

> 32-pointer nodes are better from a memory perspective as you
> mentioned. Andres' prototype supports both 16-pointer nodes and
> 32-pointer nodes (out of 6 node types). This would provide better
> memory usage but on the other hand, it would also bring overhead of
> switching the node type.

Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.

> Anyway, it's an important design decision to
> support which size of node to support. It should be done based on
> experiment results and documented.

Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase. I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
- node dispatch:
https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects. I think
it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}

...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v3-radix-review-diff-20220627.txt text/plain 6.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2022-06-27 11:22:42 Re: Logging query parmeters in auto_explain
Previous Message Bharath Rupireddy 2022-06-27 10:52:31 Re: Allow pageinspect's bt_page_stats function to return a set of rows instead of a single row