Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-03-01 08:00:21
Message-ID: CAD21AoDfTR8e0O1mXPY_wG-35x+f=dfFkD68k=pOEu=UdK0jGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > - v63-0008 patch fixes a bug in tidstore.
>
> - page->nwords = wordnum + 1;
> - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
> + page->nwords = wordnum;
> + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
>
> Yikes, I'm guessing this failed in a non-assert builds? I wonder why
> my compiler didn't yell at me... Have you tried a tidstore-debug build
> without asserts?

Yes. I didn't get any failures.

>
> > - v63-0009 patch is a draft idea of cleanup memory context handling.
>
> Thanks, looks pretty good!
>
> + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> + "tidstore storage",
>
> "tidstore storage" sounds a bit strange -- maybe look at some other
> context names for ideas.

Agreed. How about "tidstore's radix tree"?

>
> - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> + ? tree->leaf_ctx
> + : tree->context, allocsize);
>
> Instead of branching here, can we copy "context" to "leaf_ctx" when
> necessary (those names should look more like eachother, btw)? I think
> that means anything not covered by this case:
>
> +#ifndef RT_VARLEN_VALUE_SIZE
> + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> + tree->leaf_ctx = SlabContextCreate(ctx,
> + RT_STR(RT_PREFIX) "radix_tree leaf contex",
> + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> + sizeof(RT_VALUE_TYPE));
> +#endif /* !RT_VARLEN_VALUE_SIZE */
>
> ...also, we should document why we're using slab here. On that, I
> don't recall why we are? We've never had a fixed-length type test case
> on 64-bit, so it wasn't because it won through benchmarking. It seems
> a hold-over from the days of "multi-value leaves". Is it to avoid the
> possibility of space wastage with non-power-of-two size types?

Yes, it matches my understanding.

>
> For this stanza that remains unchanged:
>
> for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> {
> MemoryContextDelete(tree->node_slabs[i]);
> }
>
> if (tree->leaf_ctx)
> {
> MemoryContextDelete(tree->leaf_ctx);
> }
>
> ...is there a reason we can't just delete tree->ctx, and let that
> recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

On Fri, Mar 1, 2024 at 1:15 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
> I'm not convinced it's freeing all the memory. It's been many months
> since we discussed this last, but IIRC we cannot just tell DSA to free
> all its segments, right?

Right.

> Is there currently anything preventing us
> from destroying the whole DSA area at once?

When it comes to tidstore and parallel vacuum, we initialize DSA and
create a tidstore there at the beginning of the lazy vacuum, and
recreate the tidstore again after the heap vacuum. So I don't want to
destroy the whole DSA when destroying the tidstore. Otherwise, we will
need to create a new DSA and pass its handle somehow.

Probably the bitmap scan case is similar. Given that bitmap scan
(re)creates tidbitmap in the same DSA multiple times, it's better to
avoid freeing the whole DSA.

>
> + /* The last level node has pointers to values */
> + if (shift == 0)
> + {
> + dsa_free(tree->dsa, ptr);
> + return;
> + }
>
> IIUC, this doesn't actually free leaves, it only frees the last-level
> node. And, this function is unaware of whether children could be
> embedded values. I'm thinking we need to get rid of the above
> pre-check and instead, each node kind to have something like (e.g.
> node4):
>
> RT_PTR_ALLOC child = n4->children[i];
>
> if (shift > 0)
> RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
> else if (!RT_CHILDPTR_IS_VALUE(child))
> dsa_free(tree->dsa, child);
>
> ...or am I missing something?

You're not missing anything. RT_FREE_RECURSE() has not been updated
for a long time. If we still need to use RT_FREE_RECURSE(), it should
be updated.

> Thirdly, cosmetic: With the introduction of single-value leaves, it
> seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Agreed.

On Fri, Mar 1, 2024 at 3:58 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> I wrote:
>
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > the key is bigger than 255, new nodes will go on top of the root node.
> > These have chunk '0'. If all subsequent keys are big enough, the
> > orginal root node will stay empty. If all keys are deleted, there will
> > be a chain of empty nodes remaining. Again, I believe this is
> > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > on this, but likely not today.
>
> This turns out to be a lot trickier than it looked, so it seems best
> to allow a trivial amount of waste, as long as it's documented
> somewhere. It also wouldn't be terrible to re-add those branches,
> since they're highly predictable.
>
> I just noticed there are a lot of unused function parameters
> (referring to parent slots) leftover from a few weeks ago. Those are
> removed in v64-0009. 0010 makes the obvious name change in those
> remaining to "parent_slot". 0011 is a simplification in two places
> regarding reserving slots. This should be a bit easier to read and
> possibly makes it easier on the compiler.

Thank you for the updates. I've briefly looked at these changes and
they look good to me. I'm going to review them again in depth.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jakub Wartak 2024-03-01 08:20:30 Re: index prefetching
Previous Message Zhijie Hou (Fujitsu) 2024-03-01 07:21:18 RE: Synchronizing slots from primary to standby