Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-02-16 03:04:38
Message-ID: CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> > I've also run the same scripts in my environment just in case and got
> > similar results:
>
> Thanks for testing, looks good as well.
>
> > > There are still some micro-benchmarks we could do on tidstore, and
> > > it'd be good to find out worse-case memory use (1 dead tuple each on
> > > spread-out pages), but this is decent demonstration.
> >
> > I've tested a simple case where vacuum removes 33k dead tuples spread
> > about every 10 pages.
> >
> > master:
> > 198,000 bytes (=33000 * 6)
> > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
> >
> > v-59:
> > 2,834,432 bytes (reported by TidStoreMemoryUsage())
> > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s
>
> The memory usage for the sparse case may be a concern, although it's
> not bad -- a multiple of something small is probably not huge in
> practice. See below for an option we have for this.
>
> > > > > I'm pretty sure there's an
> > > > > accidental memset call that crept in there, but I'm running out of
> > > > > steam today.
> > >
> > > I have just a little bit of work to add for v59:
> > >
> > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > > and there are none between 64 and 128, so not a huge deal but I think
> > > it's a bit nicer in this patch.
> >
> > LGTM.
>
> Okay, I've squashed this.
>
> > I've drafted the commit message.
>
> Thanks, this is a good start.
>
> > I've run regression tests with valgrind and run the coverity scan, and
> > I don't see critical issues.
>
> Great!
>
> Now, I think we're in pretty good shape. There are a couple of things
> that might be objectionable, so I want to try to improve them in the
> little time we have:
>
> 1. Memory use for the sparse case. I shared an idea a few months ago
> of how runtime-embeddable values (true combined pointer-value slots)
> could work for tids. I don't think this is a must-have, but it's not a
> lot of code, and I have this working:
>
> v61-0006: Preparatory refactoring -- I think we should do this anyway,
> since the intent seems more clear to me.

Looks good refactoring to me.

> v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> reduce memory regressions, so should be considered. Up to 3 tids can
> be stored in the last level child pointer. It's not polished, but I'll
> only proceed with that if we think we need this. "flags" iis called
> that because it could hold tidbitmap.c booleans (recheck, lossy) in
> the future, in addition to reserving space for the pointer tag. Note:
> I hacked the tests to only have 2 offsets per block to demo, but of
> course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

uuid column index:

master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s

int & uuid indexes in parallel:

master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s
v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s

sparse case:

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

v-62:
729,088 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

>
> 2. Management of memory contexts. It's pretty verbose and messy. I
> think the abstraction could be better:
> A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> can't destroy or reset it. That means we have to do a lot of manual
> work.
> B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> it seems the wrong responsibility. Not all uses will have a
> work_mem-type limit, I'm guessing. We only use it for limiting the max
> block size, and aset's default 8MB is already plenty small for
> vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> smaller, and there it makes sense to limit the max blocksize this way.
> C: The context for values has complex #ifdefs based on the value
> length/varlen, but it's both too much and not enough. If we get a bump
> context, how would we shoehorn that in for values for vacuum but not
> for tidbitmap?
>
> Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> TidStoreCreate(), and then to RT_CREATE. That context will contain the
> values (for local mem), and the node slabs will be children of the
> value context. That way, measuring memory usage and free-ing can just
> call with this parent context, and let recursion handle the rest.
> Perhaps the passed context can also hold the radix-tree struct, but
> I'm not sure since I haven't tried it. What do you think?

If I understand your idea correctly, RT_CREATE() creates the context
for values as a child of the passed context and the node slabs as
children of the value context. That way, measuring memory usage can
just call with the value context. It sounds like a good idea. But it
was not clear to me how to address point B and C.

Another variant of this idea would be that RT_CREATE() creates the
parent context of the value context to store radix-tree struct. That
is, the hierarchy would be like:

A MemoryContext (passed by vacuum through tidstore)
- radix tree memory context (store radx-tree struct, control
struct, and iterator)
- value context (aset, slab, or bump)
- node slab contexts

Freeing can just call with the radix tree memory context. And perhaps
it works even if tidstore passes CurrentMemoryContex to RT_CREATE()?

>
> With this resolved, I think the radix tree is pretty close to
> committable. The tid store will likely need some polish yet, but no
> major issues I know of.

Agreed.

>
> (And, finally, a small thing I that I wanted to share just so I don't
> forget, but maybe not worth the attention: In Andres's prototype,
> there is a comment wondering if an update can skip checking if it
> first need to create a root node. This is pretty easy, and done in
> v61-0008.)

LGTM, thanks!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2024-02-16 03:12:47 Re: Memory consumed by paths during partitionwise join planning
Previous Message Hayato Kuroda (Fujitsu) 2024-02-16 02:52:01 RE: pg_upgrade and logical replication