Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-01-09 02:40:03
Message-ID: CAD21AoC5mRbnHkRiqpxx0bNMcVPL-vtMKvcqfnc_+y37JJyztQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 3, 2024 at 11:10 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > calls part. I think that even if we expose them, we will still need to
> > do something like "if (TidStoreIsShared(ts))
> > shared_rt_lock_share(ts->tree.shared)", no?
>
> I'll come back to this topic separately.
>
> > I've attached a new patch set. From v47 patch, I've merged your
> > changes for radix tree, and split the vacuum integration patch into 3
> > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
> > use a simple TID array for one-pass strategy (0008 patch), and replace
> > has_lpdead_items with "num_offsets > 0" (0009 patch), while
> > incorporating your review comments on the vacuum integration patch
>
> Nice!
>
> > (sorry for making it difficult to see the changes from v47 patch).
>
> It's actually pretty clear. I just have a couple comments before
> sharing my latest cleanups:
>
> (diff'ing between v47 and v48):
>
> -- /*
> - * In the shared case, TidStoreControl and radix_tree are backed by the
> - * same DSA area and rt_memory_usage() returns the value including both.
> - * So we don't need to add the size of TidStoreControl separately.
> - */
> if (TidStoreIsShared(ts))
> - return sizeof(TidStore) +
> shared_rt_memory_usage(ts->tree.shared);
> + rt_mem = shared_rt_memory_usage(ts->tree.shared);
> + else
> + rt_mem = local_rt_memory_usage(ts->tree.local);
>
> - return sizeof(TidStore) + sizeof(TidStore) +
> local_rt_memory_usage(ts->tree.local);
> + return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;
>
> Upthread, I meant that I don't see the need to include the size of
> these structs *at all*. They're tiny, and the blocks/segments will
> almost certainly have some empty space counted in the total anyway.
> The returned size is already overestimated, so this extra code is just
> a distraction.

Agreed.

>
> - if (result->num_offsets + bmw_popcount(w) > result->max_offset)
> + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
> >= result->max_offset)
>
> I believe this math is wrong. We care about "result->num_offsets +
> BITS_PER_BITMAPWORD", right?
> Also, it seems if the condition evaluates to equal, we still have
> enough space, in which case ">" the max is the right condition.

Oops, you're right. Fixed.

>
> - if (off < 1 || off > MAX_TUPLES_PER_PAGE)
> + if (off < 1 || off > MaxOffsetNumber)
>
> This can now use OffsetNumberIsValid().

Fixed.

>
> > 0013 to 0015 patches are also updates from v47 patch.
>
> > I'm thinking that we should change the order of the patches so that
> > tidstore patch requires the patch for changing DSA segment sizes. That
> > way, we can remove the complex max memory calculation part that we no
> > longer use from the tidstore patch.
>
> I don't think there is any reason to have those calculations at all at
> this point. Every patch in every version should at least *work
> correctly*, without kludging m_w_m and without constraining max
> segment size. I'm fine with the latter remaining in its own thread,
> and I hope we can consider it an enhancement that respects the admin's
> configured limits more effectively, and not a pre-requisite for not
> breaking. I *think* we're there now, but it's hard to tell since 0015
> was at the very end. As I said recently, if something still fails, I'd
> like to know why. So for v49, I took the liberty of removing the DSA
> max segment patches for now, and squashing v48-0015.

Fair enough.

>
> In addition for v49, I have quite a few cleanups:
>
> 0001 - This hasn't been touched in a very long time, but I ran
> pgindent and clarified a comment
> 0002 - We no longer need to isolate the rightmost bit anywhere, so
> removed that part and revised the commit message accordingly.

Thanks.

>
> radix tree:
> 0003 - v48 plus squashed v48-0013
> 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
> and I fixed most of the rest.
> 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
> 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
> bit simpler.
> 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
> 0008 - Removes a holdover from the multi-value leaves era.
> 0009 - It occurred to me that we need to have unique names for memory
> contexts for different instantiations of the template. This is one way
> to do it, by using the configured RT_PREFIX in the context name. I
> also took an extra step to make the size class fanout show up
> correctly on different platforms, but that's probably overkill and
> undesirable, and I'll probably use only the class name next time.
> 0010/11 - Make the array functions less surprising and with more
> informative names.
> 0012 - Restore a useful technique from Andres's prototype. This part
> has been slow for a long time, so much that it showed up in a profile
> where this path wasn't even taken much.

These changes look good to me. I've squashed them.

In addition, I've made some changes and cleanups:

0010 - address the above review comments.
0011 - simplify the radix tree iteration code. I hope it makes the
code clear and readable. Also I removed RT_UPDATE_ITER_STACK().
0012 - fix a typo
0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
even if SIZEOF_VOID_P is 8 bytes, for example in a case where
!defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
details.
0014 - cleanup RT_VERIFY code.
0015 - change and cleanup RT_DUMP_NODE(). Now it dumps only one node
and no longer supports dumping nodes recursively.
0016 - remove RT_DUMP_SEARCH() and RT_DUMP(). These seem no longer necessary.
0017 - MOve RT_DUMP_NODE to the debug function section, close to RT_STATS.
0018 - Fix a printf format in RT_STATS().

BTW, now that the inner and leaf nodes use the same structure, do we
still need RT_NODE_BASE_XXX types? Most places where we use
RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.
Exceptions are RT_FANOUT_XX calculations:

#if SIZEOF_VOID_P < 8
#define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#else
#define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#endif /* SIZEOF_VOID_P < 8 */

But I think we can replace them with offsetof(RT_NODE_16, children) etc.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v50-ART.tar.gz application/x-gzip 64.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-09 03:05:05 Re: Random pg_upgrade test failure on drongo
Previous Message John Naylor 2024-01-09 02:20:09 Re: add AVX2 support to simd.h