Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-02-02 11:47:02
Message-ID: CANWCAZZMeKnPV1ZybZZN7S5=AF=uE7zUsix=H2VP0=UTKrOMhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> I've attached the new patch set (v56). I've squashed previous updates
> and addressed review comments on v55 in separate patches. Here are the
> update summary:
>
> 0004: fix compiler warning caught by ci test.
> 0005-0008: address review comments on radix tree codes.
> 0009: cleanup #define and #undef
> 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> undefined after including radixtree.h so we should not use it in test
> code.

Great, thanks!

I have a few questions and comments on v56, then I'll address yours
below with the attached v57, which is mostly cosmetic adjustments.

v56-0003:

(Looking closer at tests)

+static const bool rt_test_stats = false;

I'm thinking we should just remove everything that depends on this,
and keep this module entirely about correctness.

+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

v56-0007:

+ *
+ * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
+ * the caller doesn't need to take a lock.

Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

v56-0008

Thanks, I like how the tests look now.

-NOTICE: testing node 4 with height 0 and ascending keys
...
+NOTICE: testing node 1 with height 0 and ascending keys

Now that the number is not intended to match a size class, "node X"
seems out of place. Maybe we could have a separate array with strings?

+ 1, /* RT_CLASS_4 */

This should be more than one, so that the basic test still exercises
paths that shift elements around.

+ 100, /* RT_CLASS_48 */

This node currently holds 64 for local memory.

+ 255 /* RT_CLASS_256 */

This is the only one where we know exactly how many it can take, so
may as well keep it at 256.

v56-0012:

The test module for tidstore could use a few more comments.

v56-0015:

+typedef dsa_pointer TidStoreHandle;
+

-TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
{
TidStore *ts;
+ dsa_pointer rt_dp = handle;

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

-----------------------------------------------------------------------------------

Now, for v57:

> Looking at overall changes, there are still XXX and TODO comments in
> radixtree.h:

That's fine, as long as it's intentional as a message to readers. That
said, we can get rid of some:

> ---
> * XXX There are 4 node kinds, and this should never be increased,
> * for several reasons:
> * 1. With 5 or more kinds, gcc tends to use a jump table for switch
> * statements.
> * 2. The 4 kinds can be represented with 2 bits, so we have the option
> * in the future to tag the node pointer with the kind, even on
> * platforms with 32-bit pointers. This might speed up node traversal
> * in trees with highly random node kinds.
> * 3. We can have multiple size classes per node kind.
>
> Can we just remove "XXX"?

How about "NOTE"?

> ---
> * WIP: notes about traditional radix tree trading off span vs height...
>
> Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

> ---
> #ifdef RT_SHMEM
> /* WIP: do we really need this? */
> typedef dsa_pointer RT_HANDLE;
> #endif
>
> I think it's worth having it.

Okay, removed WIP in v57-0004.

> ---
> * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
> * inside a single bitmapword on most platforms, so it's a good starting
> * point. We can make it higher if we need to.
> */
> #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
>
> Are you going to work something on this?

Hard-coded 64 for readability, and changed this paragraph to explain
the current rationale more clearly:

"The paper uses at most 64 for this node kind, and one advantage for us
is that "isset" is a single bitmapword on most platforms, rather than
an array, allowing the compiler to get rid of loops."

> ---
> /* WIP: We could go first to the higher node16 size class */
> newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);
>
> Does it mean to go to RT_CLASS_16_HI and then further go to
> RT_CLASS_16_LO upon further deletion?

Yes. It wouldn't be much work to make shrinking symmetrical with
growing (a good thing), but it's not essential so I haven't done it
yet.

> ---
> * TODO: The current locking mechanism is not optimized for high concurrency
> * with mixed read-write workloads. In the future it might be worthwhile
> * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
> * the paper "The ART of Practical Synchronization" by the same authors as
> * the ART paper, 2016.
>
> I think it's not TODO for now, but a future improvement. We can remove it.

It _is_ a TODO, regardless of when it happens.

> ---
> /* TODO: consider 5 with subclass 1 or 2. */
> #define RT_FANOUT_4 4
>
> Is there something we need to do here?

Changed to:

"To save memory in trees with sparse keys, it would make sense to have two
size classes for the smallest kind (perhaps a high class of 5 and a low class
of 2), but it would be more effective to utilize lazy expansion and
path compression."

> ---
> /*
> * Return index of the chunk and slot arrays for inserting into the node,
> * such that the chunk array remains ordered.
> * TODO: Improve performance for non-SIMD platforms.
> */
>
> Are you going to work on this?

A small step in v57-0010. I've found a way to kill two birds with one
stone, by first checking for the case that the keys are inserted in
order. This also helps the SIMD case because it must branch anyway to
avoid bitscanning a zero bitfield. This moves the branch up and turns
a mask into an assert, looking a bit nicer. I've removed the TODO, but
maybe we should add it to the search_eq function.

> ---
> /* Delete the element at 'idx' */
> /* TODO: replace slow memmove's */
>
> Are you going to work on this?

Done in v57-0011.

The rest:
v57-0004 - 0008 should be self explanatory, but questions/pushback welcome.
v57-0009 - I'm thinking leaves don't need to be memset at all. The
value written should be entirely the caller's responsibility, it
seems.
v57-0013 - the bench module can be built locally again
v57-0016 - minor comment edits in tid store

My todo:
- benchmark tid store / vacuum again, since we haven't since varlen
types and removing unnecessary locks. I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.
- leftover comment etc work

Attachment Content-Type Size
v57-ART.tar.gz application/gzip 66.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2024-02-02 11:56:08 Re: InstallXLogFileSegment() vs concurrent WAL flush
Previous Message David Rowley 2024-02-02 11:35:52 Re: An improvement on parallel DISTINCT