Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-02-19 02:01:45
Message-ID: CAD21AoD3rjZ7yaK0OjsoUEKpiZaQV0f2S+sOYM5q6eoetkN2Pw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 16, 2024 at 12:41 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > > reduce memory regressions, so should be considered. Up to 3 tids can
> > > be stored in the last level child pointer. It's not polished, but I'll
> > > only proceed with that if we think we need this. "flags" iis called
> > > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > > the future, in addition to reserving space for the pointer tag. Note:
> > > I hacked the tests to only have 2 offsets per block to demo, but of
> > > course both paths should be tested.
> >
> > Interesting. I've run the same benchmark tests we did[1][2] (the
> > median of 3 runs):
> >
> > monotonically ordered int column index:
> >
> > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s
>
> Hmm, that's strange -- this test is intended to delete all records
> from the last 20% of the blocks, so I wouldn't expect any improvement
> here, only in the sparse case. Maybe something is wrong. All the more
> reason to put it off...

Okay, let's dig it deeper later.

>
> > I'm happy to see a huge improvement. While it's really fascinating to
> > me, I'm concerned about the time left until the feature freeze. We
> > need to polish both tidstore and vacuum integration patches in 5
> > weeks. Personally I'd like to have it as a separate patch for now, and
> > focus on completing the main three patches since we might face some
> > issues after pushing these patches. I think with 0007 patch it's a big
> > win but it's still a win even without 0007 patch.
>
> Agreed to not consider it for initial commit. I'll hold on to it for
> some future time.
>
> > > 2. Management of memory contexts. It's pretty verbose and messy. I
> > > think the abstraction could be better:
> > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > > can't destroy or reset it. That means we have to do a lot of manual
> > > work.
> > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > > it seems the wrong responsibility. Not all uses will have a
> > > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > > block size, and aset's default 8MB is already plenty small for
> > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > > smaller, and there it makes sense to limit the max blocksize this way.
> > > C: The context for values has complex #ifdefs based on the value
> > > length/varlen, but it's both too much and not enough. If we get a bump
> > > context, how would we shoehorn that in for values for vacuum but not
> > > for tidbitmap?
> > >
> > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > > values (for local mem), and the node slabs will be children of the
> > > value context. That way, measuring memory usage and free-ing can just
> > > call with this parent context, and let recursion handle the rest.
> > > Perhaps the passed context can also hold the radix-tree struct, but
> > > I'm not sure since I haven't tried it. What do you think?
> >
> > If I understand your idea correctly, RT_CREATE() creates the context
> > for values as a child of the passed context and the node slabs as
> > children of the value context. That way, measuring memory usage can
> > just call with the value context. It sounds like a good idea. But it
> > was not clear to me how to address point B and C.
>
> For B & C, vacuum would create a context to pass to TidStoreCreate,
> and it wouldn't need to bother changing max block size. RT_CREATE
> would use that directly for leaves (if any), and would only create
> child slab contexts under it. It would not need to know about
> max_bytes. Modifyng your diagram a bit, something like:
>
> - caller-supplied radix tree memory context (the 3 structs -- and
> leaves, if any) (aset (or future bump?))
> - node slab contexts
>
> This might only be workable with aset, if we need to individually free
> the structs. (I haven't studied this, it was a recent idea)
> It's simpler, because with small fixed length values, we don't need to
> detect that and avoid creating a leaf context. All leaves would live
> in the same context as the structs.

Thank you for the explanation.

I think that vacuum and tidbitmap (and future users) would end up
having the same max block size calculation. And it seems slightly odd
layering to me that max-block-size-specified context is created on
vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
tidstore layer, and the passed context is used for leaves (if
varlen-value is used) on radix tree layer. Another idea is to create a
max-block-size-specified context on the tidstore layer. That is,
vacuum and tidbitmap pass a work_mem and a flag indicating whether the
tidstore can use the bump context, and tidstore creates a (aset of
bump) memory context with the calculated max block size and passes it
to the radix tree.

As for using the bump memory context, I feel that we need to store
iterator struct in aset context at least as it can be individually
freed and re-created. Or it might not be necessary to allocate the
iterator struct in the same context as radix tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2024-02-19 02:26:06 Re: Thoughts about NUM_BUFFER_PARTITIONS
Previous Message Andrei Lepikhov 2024-02-19 01:37:54 Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning