Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-06-06 05:13:36
Message-ID: CAFBsxsEJ7PhhE6dKXtVm3YnT=T7wgNWbOEEXRXm5wDMyKBG2Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> > Sometime in the not-too-distant future, I will start a new thread
focusing on bitmap heap scan, but for now, I just want to share some
progress on making the radix tree usable not only for that, but hopefully a
wider range of applications, while making the code simpler and the binary
smaller. The attached patches are incomplete (e.g. no iteration) and quite
a bit messy, so tar'd and gzip'd for the curious (should apply on top of
v32 0001-03 + 0007-09 ).
> >
>
> Thank you for making progress on this. I agree with these directions
> overall. I have some comments and questions:

Glad to hear it and thanks for looking!

> > * Other nodes will follow in due time, but only after I figure out how
to do it nicely (ideas welcome!) -- currently node32's two size classes
work fine for growing, but the code should be simplified before extending
to other cases.)
>
> Within the size class, we just alloc a new node of lower size class
> and do memcpy(). I guess it will be almost same as what we do for
> growing.

Oh, the memcpy part is great, very simple. I mean the (compile-time) "class
info" table lookups are a bit awkward. I'm thinking the hard-coded numbers
like this:

.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),

...may be better with a #defined symbol that can also be used elsewhere.

> I don't think
> shrinking class-3 to class-1 makes sense.

Agreed. The smallest kind should just be freed when empty.

> > Limited support for "combined pointer-value slots". At compile-time,
choose either that or "single-value leaves" based on the size of the value
type template parameter. Values that are pointer-sized or less can fit in
the last-level child slots of nominal "inner nodes" without duplicated
leaf-node code. Node256 now must act like the previous 'node256 leaf',
since zero is a valid value. Aside from that, this was a small change.
>
> Yes, but it also means that we use pointer-sized value anyway even if
> the value size is less than that, which wastes the memory, no?

At a low-level, that makes sense, but I've found an interesting global
effect showing the opposite: _less_ memory, which may compensate:

psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660

(using a low enough number that the experimental change n125->n63 doesn't
affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025

v31:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
47800768 | 253 | 134

(unreleased code "similar" to v33, but among other things restores the
separate "extend down" function)
mem_allocated | load_ms | search_ms
---------------+---------+-----------
42926048 | 221 | 127

I'd need to make sure, but apparently just going from 6 non-empty memory
contexts to 3 (remember all values are embedded here) reduces memory
fragmentation significantly in this test. (That should also serve as a
demonstration that additional size classes have both runtime costs as well
as benefits. We need to have a balance.)

So, I'm inclined to think the only reason to prefer "multi-value leaves" is
if 1) the value type is _bigger_ than a pointer 2) there is no convenient
abbreviation (like tid bitmaps have) and 3) the use case really needs to
avoid another memory access. Under those circumstances, though, the new
code plus lazy expansion etc might suit and be easier to maintain. That
said, I've mostly left alone the "leaf" types and functions, as well as
added some detritus like "const bool = false;". It would look a *lot* nicer
if we gave up on multi-value leaves entirely, but there's no rush and I
don't want to close that door entirely just yet.

> > What I've shared here could work (in principal, since it uses uint64
values) for tidstore, possibly faster (untested) because of better code
density, but as mentioned I want to shoot for higher. For tidbitmap.c, I
want to extend this idea and branch at run-time on a per-value basis, so
that a page-table entry that fits in a pointer can go there, and if not,
it'll be a full leaf. (This technique enables more flexibility in
lossifying pages as well.) Run-time info will require e.g. an additional
bit per slot. Since the node header is now 3 bytes, we can spare one more
byte in the node3 case. In addition, we can and should also bump it back up
to node4, still keeping the metadata within 8 bytes (no struct padding).
>
> Sounds good.

The additional bit per slot would require per-node logic and additional
branches, which is not great. I'm now thinking a much easier way to get
there is to give up (at least for now) on promising that "run-time
embeddable values" can use the full pointer-size (unlike value types found
embeddable at compile-time). Reserving the lowest pointer bit for a tag
"value or pointer-to-leaf" would have a much smaller code footprint. That
also has a curious side-effect for TID offsets: They are one-based so
reserving the zero bit would actually simplify things: getting rid of the
+1/-1 logic when converting bits to/from offsets.

In addition, without a new bitmap, the smallest node can actually be up to
a node5 with no struct padding, with a node2 as a subclass. (Those numbers
coincidentally were also one scenario in the paper, when calculating
worst-case memory usage). That's worth considering.

> > I've started in this patchset to refer to the node kinds as
"4/16/48/256", regardless of their actual fanout.

> If we want to use the node kinds used in the paper, I think we should
> change the number in RT_NODE_KIND_X too.

Oh absolutely, this is nowhere near ready for cosmetic review :-)

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Wei Wang (Fujitsu) 2023-06-06 06:01:33 RE: Support logical replication of DDLs
Previous Message Bharath Rupireddy 2023-06-06 04:30:00 Re: Implement generalized sub routine find_in_log for tap test