Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-06-05 10:31:41
Message-ID: CAD21AoAWhHSDeNdVEmBpOcVLMav0mz4AJfMN2wURc+h85q7PUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, May 23, 2023 at 7:17 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> I wrote:
> > the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.
>
> Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
>

Thank you for making progress on this. I agree with these directions
overall. I have some comments and questions:

> - With the latter in mind, searching the child within a node now returns the address of the slot. This allows the same interface whether the slot contains a child pointer or a value.

Probably we can apply similar changes to the iteration as well.

> * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)

Within the size class, we just alloc a new node of lower size class
and do memcpy(). I guess it will be almost same as what we do for
growing. It might be a good idea to support node shrinking within the
size class for node32 (and node125 if we support). I don't think
shrinking class-3 to class-1 makes sense.

>
> Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.

Yes, but it also means that we use pointer-sized value anyway even if
the value size is less than that, which wastes the memory, no?

>
> What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).

Sounds good.

> I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is for readability (by matching the language in the paper) and maintainability (should *not* ever change again). The size classes (including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD architectures, it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size classes similar to these four nominal kinds.

If we want to use the node kinds used in the paper, I think we should
change the number in RT_NODE_KIND_X too. Otherwise, it would be
confusing when reading the code without referring to the paper.
Particularly, this part is very confusing:

case RT_NODE_KIND_3:
RT_ADD_CHILD_4(tree, ref, node, chunk, child);
break;
case RT_NODE_KIND_32:
RT_ADD_CHILD_16(tree, ref, node, chunk, child);
break;
case RT_NODE_KIND_125:
RT_ADD_CHILD_48(tree, ref, node, chunk, child);
break;
case RT_NODE_KIND_256:
RT_ADD_CHILD_256(tree, ref, node, chunk, child);
break;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2023-06-05 11:06:26 Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)
Previous Message shveta malik 2023-06-05 09:30:01 Re: Support logical replication of DDLs