Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-02-15 11:26:13
Message-ID: CANWCAZbumguvp93hZhjGW16=Kf20tHJ0pyEXEFdZuCS0m7bcEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:

> I've also run the same scripts in my environment just in case and got
> similar results:

Thanks for testing, looks good as well.

> > There are still some micro-benchmarks we could do on tidstore, and
> > it'd be good to find out worse-case memory use (1 dead tuple each on
> > spread-out pages), but this is decent demonstration.
>
> I've tested a simple case where vacuum removes 33k dead tuples spread
> about every 10 pages.
>
> master:
> 198,000 bytes (=33000 * 6)
> system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
>
> v-59:
> 2,834,432 bytes (reported by TidStoreMemoryUsage())
> system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

The memory usage for the sparse case may be a concern, although it's
not bad -- a multiple of something small is probably not huge in
practice. See below for an option we have for this.

> > > > I'm pretty sure there's an
> > > > accidental memset call that crept in there, but I'm running out of
> > > > steam today.
> >
> > I have just a little bit of work to add for v59:
> >
> > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > and there are none between 64 and 128, so not a huge deal but I think
> > it's a bit nicer in this patch.
>
> LGTM.

Okay, I've squashed this.

> I've drafted the commit message.

Thanks, this is a good start.

> I've run regression tests with valgrind and run the coverity scan, and
> I don't see critical issues.

Great!

Now, I think we're in pretty good shape. There are a couple of things
that might be objectionable, so I want to try to improve them in the
little time we have:

1. Memory use for the sparse case. I shared an idea a few months ago
of how runtime-embeddable values (true combined pointer-value slots)
could work for tids. I don't think this is a must-have, but it's not a
lot of code, and I have this working:

v61-0006: Preparatory refactoring -- I think we should do this anyway,
since the intent seems more clear to me.
v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

With this resolved, I think the radix tree is pretty close to
committable. The tid store will likely need some polish yet, but no
major issues I know of.

(And, finally, a small thing I that I wanted to share just so I don't
forget, but maybe not worth the attention: In Andres's prototype,
there is a comment wondering if an update can skip checking if it
first need to create a root node. This is pretty easy, and done in
v61-0008.)

Attachment Content-Type Size
v61-ART.tar.gz application/gzip 59.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-02-15 11:30:18 Re: Synchronizing slots from primary to standby
Previous Message Bertrand Drouvot 2024-02-15 11:24:59 Re: Fix race condition in InvalidatePossiblyObsoleteSlot()