Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-06-13 05:46:57
Message-ID: CAD21AoDQw3DTQkq_VU_1B5qytXwtDRjrvB5EOHrfAuBZanah0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
> > >
> >
> > Thank you for making progress on this. I agree with these directions
> > overall. I have some comments and questions:
>
> Glad to hear it and thanks for looking!
>
> > > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)
> >
> > Within the size class, we just alloc a new node of lower size class
> > and do memcpy(). I guess it will be almost same as what we do for
> > growing.
>
> Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:
>
> .fanout = 3,
> .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
>
> ...may be better with a #defined symbol that can also be used elsewhere.

FWIW, exposing these definitions would be good in terms of testing too
since we can use them in regression tests.

>
> > I don't think
> > shrinking class-3 to class-1 makes sense.
>
> Agreed. The smallest kind should just be freed when empty.
>
> > > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.
> >
> > Yes, but it also means that we use pointer-sized value anyway even if
> > the value size is less than that, which wastes the memory, no?
>
> At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, which may compensate:
>
> psql -c "select * from bench_search_random_nodes(1*1000*1000)"
> num_keys = 992660
>
> (using a low enough number that the experimental change n125->n63 doesn't affect anything)
> height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025
>
> v31:
> mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
> 47800768 | 253 | 134
>
> (unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
> mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
> 42926048 | 221 | 127
>
> I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)

Interesting. The result would probably vary if we change the slab
block sizes. I'd like to experiment if the code is available.

>
> So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer 2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memory access. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain.

Indeed.

>
> > > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).
> >
> > Sounds good.
>
> The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint.

Do you mean we can make sure that the value doesn't set the lowest
bit? Or is it an optimization for TIDStore?

> In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2 as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memory usage). That's worth considering.

Agreed.

FWIW please let me know if there are some experiments I can help with.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-06-13 05:50:30 Re: Improve logging when using Huge Pages
Previous Message Amit Kapila 2023-06-13 05:15:28 Re: Support logical replication of DDLs