Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-02-06 02:57:58
Message-ID: CAD21AoCKBiTr6Voy20qJ09j5DZa0cOuc=VOzwm-sCLK7PDG8xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > I've attached the new patch set (v56). I've squashed previous updates
> > and addressed review comments on v55 in separate patches. Here are the
> > update summary:
> >
> > 0004: fix compiler warning caught by ci test.
> > 0005-0008: address review comments on radix tree codes.
> > 0009: cleanup #define and #undef
> > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> > undefined after including radixtree.h so we should not use it in test
> > code.
>
> Great, thanks!
>
> I have a few questions and comments on v56, then I'll address yours
> below with the attached v57, which is mostly cosmetic adjustments.

Thank you for the comments! I've squashed previous updates and your changes.

>
> v56-0003:
>
> (Looking closer at tests)
>
> +static const bool rt_test_stats = false;
>
> I'm thinking we should just remove everything that depends on this,
> and keep this module entirely about correctness.

Agreed. Removed in 0006 patch.

>
> + for (int shift = 0; shift <= (64 - 8); shift += 8)
> + test_node_types(shift);
>
> I'm not sure what the test_node_types_* functions are testing that
> test_basic doesn't. They have a different, and confusing, way to stop
> at every size class and check the keys/values. It seems we can replace
> all that with two more calls (asc/desc) to test_basic, with the
> maximum level.

Agreed, addressed in 0007 patch.

>
> It's pretty hard to see what test_pattern() is doing, or why it's
> useful. I wonder if instead the test could use something like the
> benchmark where random integers are masked off. That seems simpler. I
> can work on that, but I'd like to hear your side about test_pattern().

Yeah, test_pattern() is originally created for the integerset so it
doesn't necessarily fit the radixtree. I agree to use some tests from
benchmarks.

>
> v56-0007:
>
> + *
> + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
> + * the caller doesn't need to take a lock.
>
> Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

Agreed. Fixed in 0005 patch.

>
> v56-0008
>
> Thanks, I like how the tests look now.
>
> -NOTICE: testing node 4 with height 0 and ascending keys
> ...
> +NOTICE: testing node 1 with height 0 and ascending keys
>
> Now that the number is not intended to match a size class, "node X"
> seems out of place. Maybe we could have a separate array with strings?
>
> + 1, /* RT_CLASS_4 */
>
> This should be more than one, so that the basic test still exercises
> paths that shift elements around.
>
> + 100, /* RT_CLASS_48 */
>
> This node currently holds 64 for local memory.
>
> + 255 /* RT_CLASS_256 */
>
> This is the only one where we know exactly how many it can take, so
> may as well keep it at 256.

Fixed in 0008 patch.

>
> v56-0012:
>
> The test module for tidstore could use a few more comments.

Addressed in 0012 patch.

>
> v56-0015:
>
> +typedef dsa_pointer TidStoreHandle;
> +
>
> -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
> +TidStoreAttach(dsa_area *area, TidStoreHandle handle)
> {
> TidStore *ts;
> + dsa_pointer rt_dp = handle;
>
> My earlier opinion was that "handle" was a nicer variable name, but
> this brings back the typedef and also keeps the variable name I didn't
> like, but pushes it down into the function. I'm a bit confused, so
> I've kept these not-squashed for now.

I misunderstood your comment. I've changed to use a variable name
rt_handle and removed the TidStoreHandle type. 0013 patch.

>
> -----------------------------------------------------------------------------------
>
> Now, for v57:
>
> > Looking at overall changes, there are still XXX and TODO comments in
> > radixtree.h:
>
> That's fine, as long as it's intentional as a message to readers. That
> said, we can get rid of some:
>
> > ---
> > * XXX There are 4 node kinds, and this should never be increased,
> > * for several reasons:
> > * 1. With 5 or more kinds, gcc tends to use a jump table for switch
> > * statements.
> > * 2. The 4 kinds can be represented with 2 bits, so we have the option
> > * in the future to tag the node pointer with the kind, even on
> > * platforms with 32-bit pointers. This might speed up node traversal
> > * in trees with highly random node kinds.
> > * 3. We can have multiple size classes per node kind.
> >
> > Can we just remove "XXX"?
>
> How about "NOTE"?

Agreed.

>
> > ---
> > * WIP: notes about traditional radix tree trading off span vs height...
> >
> > Are you going to write it?
>
> Yes, when I draft a rough commit message, (for next time).

Thanks!

>
> > ---
> > #ifdef RT_SHMEM
> > /* WIP: do we really need this? */
> > typedef dsa_pointer RT_HANDLE;
> > #endif
> >
> > I think it's worth having it.
>
> Okay, removed WIP in v57-0004.
>
> > ---
> > * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
> > * inside a single bitmapword on most platforms, so it's a good starting
> > * point. We can make it higher if we need to.
> > */
> > #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
> >
> > Are you going to work something on this?
>
> Hard-coded 64 for readability, and changed this paragraph to explain
> the current rationale more clearly:
>
> "The paper uses at most 64 for this node kind, and one advantage for us
> is that "isset" is a single bitmapword on most platforms, rather than
> an array, allowing the compiler to get rid of loops."

LGTM.

>
> > ---
> > /* WIP: We could go first to the higher node16 size class */
> > newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);
> >
> > Does it mean to go to RT_CLASS_16_HI and then further go to
> > RT_CLASS_16_LO upon further deletion?
>
> Yes. It wouldn't be much work to make shrinking symmetrical with
> growing (a good thing), but it's not essential so I haven't done it
> yet.

Okay, let's keep it as WIP.

>
> > ---
> > * TODO: The current locking mechanism is not optimized for high concurrency
> > * with mixed read-write workloads. In the future it might be worthwhile
> > * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
> > * the paper "The ART of Practical Synchronization" by the same authors as
> > * the ART paper, 2016.
> >
> > I think it's not TODO for now, but a future improvement. We can remove it.
>
> It _is_ a TODO, regardless of when it happens.

Understood.

>
> > ---
> > /* TODO: consider 5 with subclass 1 or 2. */
> > #define RT_FANOUT_4 4
> >
> > Is there something we need to do here?
>
> Changed to:
>
> "To save memory in trees with sparse keys, it would make sense to have two
> size classes for the smallest kind (perhaps a high class of 5 and a low class
> of 2), but it would be more effective to utilize lazy expansion and
> path compression."

LGTM. But there is an extra '*' in the last line:

+ /*
:
+ * of 2), but it would be more effective to utilize lazy expansion and
+ * path compression.
+ * */

Fixed in 0004 patch.

>
> > ---
> > /*
> > * Return index of the chunk and slot arrays for inserting into the node,
> > * such that the chunk array remains ordered.
> > * TODO: Improve performance for non-SIMD platforms.
> > */
> >
> > Are you going to work on this?
>
> A small step in v57-0010. I've found a way to kill two birds with one
> stone, by first checking for the case that the keys are inserted in
> order. This also helps the SIMD case because it must branch anyway to
> avoid bitscanning a zero bitfield. This moves the branch up and turns
> a mask into an assert, looking a bit nicer. I've removed the TODO, but
> maybe we should add it to the search_eq function.

Great!

>
> > ---
> > /* Delete the element at 'idx' */
> > /* TODO: replace slow memmove's */
> >
> > Are you going to work on this?
>
> Done in v57-0011.

LGTM.

>
> The rest:
> v57-0004 - 0008 should be self explanatory, but questions/pushback welcome.
> v57-0009 - I'm thinking leaves don't need to be memset at all. The
> value written should be entirely the caller's responsibility, it
> seems.
> v57-0013 - the bench module can be built locally again
> v57-0016 - minor comment edits in tid store

These fixes look good to me.

>
> My todo:
> - benchmark tid store / vacuum again, since we haven't since varlen
> types and removing unnecessary locks. I'm pretty sure there's an
> accidental memset call that crept in there, but I'm running out of
> steam today.
> - leftover comment etc work

Thanks. I'm also going to do some benchmarks and tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v58-ART.tar.gz application/x-gzip 58.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-02-06 03:00:00 Re: Why is subscription/t/031_column_list.pl failing so much?
Previous Message Michael Paquier 2024-02-06 02:41:06 Re: Make COPY format extendable: Extract COPY TO format implementations