Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-01-26 14:05:54
Message-ID: CAD21AoDoWrbNf+K2Fwg2n=CZDHigjkndwqy_86BGgXBp9Kbq4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > The new patches probably need to be polished but the VacDeadItemInfo
> > idea looks good to me.
>
> That idea looks good to me, too. Since you already likely know what
> you'd like to polish, I don't have much to say except for a few
> questions below. I also did a quick sweep through every patch, so some
> of these comments are unrelated to recent changes:

Thank you!

>
> v55-0003:
>
> +size_t
> +dsa_get_total_size(dsa_area *area)
> +{
> + size_t size;
> +
> + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
> + size = area->control->total_segment_size;
> + LWLockRelease(DSA_AREA_LOCK(area));
>
> I looked and found dsa.c doesn't already use shared locks in HEAD,
> even dsa_dump. Not sure why that is...

Oh, the dsa_dump part seems to be a bug. But it'll keep it consistent
with others.

>
> +/*
> + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> + * from the block.
> + */
> +#define RT_SLAB_BLOCK_SIZE(size) \
> + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
>
> The first parameter seems to be trying to make the block size exact,
> but that's not right, because of the chunk header, and maybe
> alignment. If the default block size is big enough to waste only a
> tiny amount of space, let's just use that as-is.

Agreed.

> Also, I think all
> block sizes in the code base have been a power of two, but I'm not
> sure how much that matters.

Did you mean all slab block sizes we use in radixtree.h?

>
> +#ifdef RT_SHMEM
> + fprintf(stderr, " [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
> + i, n4->chunks[i], n4->children[i]);
> +#else
> + fprintf(stderr, " [%d] chunk %x slot %p\n",
> + i, n4->chunks[i], n4->children[i]);
> +#endif
>
> Maybe we could invent a child pointer format, so we only #ifdef in one place.

WIll change.

>
> --- /dev/null
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -0,0 +1,35 @@
> +# FIXME: prevent install during main install, but not during test :/
>
> Can you look into this?

Okay, I'll look at it.

>
> test_radixtree.c:
>
> +/*
> + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
> + */
> +static int rt_node_class_fanouts[] = {
> + 4, /* RT_CLASS_3 */
> + 15, /* RT_CLASS_32_MIN */
> + 32, /* RT_CLASS_32_MAX */
> + 125, /* RT_CLASS_125 */
> + 256 /* RT_CLASS_256 */
> +};
>
> These numbers have been wrong a long time, too, but only matters for
> figuring out where it went wrong when something is broken. And for the
> XXX, instead of trying to use the largest number that should fit (it's
> obviously not testing that the expected node can actually hold that
> number anyway), it seems we can just use a "big enough" number to
> cause growing into the desired size class.
>
> As far as cleaning up the tests, I always wondered why these didn't
> use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
> where convenient, and leave comments above the tests. That seemed like
> a good idea to me -- was there a reason to have hand-written branches
> and elog messages everywhere?

The current test is based on test_integerset. I agree that we can
improve it by using EXPECT_TRUE etc.

>
> --- a/src/tools/pginclude/cpluspluscheck
> +++ b/src/tools/pginclude/cpluspluscheck
> @@ -101,6 +101,12 @@ do
> test "$f" = src/include/nodes/nodetags.h && continue
> test "$f" = src/backend/nodes/nodetags.h && continue
>
> + # radixtree_*_impl.h cannot be included standalone: they are just
> code fragments.
> + test "$f" = src/include/lib/radixtree_delete_impl.h && continue
> + test "$f" = src/include/lib/radixtree_insert_impl.h && continue
> + test "$f" = src/include/lib/radixtree_iter_impl.h && continue
> + test "$f" = src/include/lib/radixtree_search_impl.h && continue
>
> Ha! I'd forgotten about these -- they're long outdated.

Will remove.

>
> v55-0005:
>
> - * The radix tree is locked in shared mode during the iteration, so
> - * RT_END_ITERATE needs to be called when finished to release the lock.
> + * The caller needs to acquire a lock in shared mode during the iteration
> + * if necessary.
>
> "need if necessary" is maybe better phrased as "is the caller's responsibility"

Will fix.

>
> + /*
> + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
> + */
> total = dsa_get_total_size(tree->dsa);
>
> Maybe better to have a header comment for RT_MEMORY_USAGE that the
> caller doesn't need to take a lock.

Will fix.

>
> v55-0006:
>
> "WIP: Not built, since some benchmarks have broken" -- I'll work on
> this when I re-run some benchmarks.

Thanks!

>
> v55-0007:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> This hasn't been true for a few months now, and I thought we fixed
> this in some earlier version?

Yeah, I'll fix it.

>
> + * TODO: The caller must be certain that no other backend will attempt to
> + * access the TidStore before calling this function. Other backend must
> + * explicitly call TidStoreDetach() to free up backend-local memory associated
> + * with the TidStore. The backend that calls TidStoreDestroy() must not call
> + * TidStoreDetach().
>
> Do we need to do anything now?

No, will remove it.

>
> v55-0008:
>
> -TidStoreAttach(dsa_area *area, TidStoreHandle handle)
> +TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
>
> "handle" seemed like a fine name. Is that not the case anymore? The
> new one is kind of cryptic. The commit message just says "remove
> control object" -- does that imply that we need to think of this
> parameter differently, or is it unrelated? (Same with
> dead_items_handle in 0011)

Since it's actually just a radix tree's handle it was kind of
unnatural to me to use the same dsa_pointer as different handles. But
rethinking it, I agree "handle" is a fine name.

>
> v55-0011:
>
> + /*
> + * Recreate the tidstore with the same max_bytes limitation. We cannot
> + * use neither maintenance_work_mem nor autovacuum_work_mem as they could
> + * already be changed.
> + */
>
> I don't understand this part.

I wanted to mean that if maintenance_work_mem is changed and the
config file is reloaded, its value could no longer be the same as the
one that we used when initializing the parallel vacuum. That's why we
need to store max_bytes in the DSM. I'll rephrase it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2024-01-26 14:18:50 Re: [PATCH] Add native windows on arm64 support
Previous Message Alvaro Herrera 2024-01-26 14:01:52 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands