Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-25 01:42:05
Message-ID: CAD21AoA-=5rpABAGCu2tJ=YzfTtNgJ5WxkZR+4SS0W54ccN5Ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.

Hmm, I don't remember I proposed such a patch, either.

One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.

Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.

FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c

>
> > + * XXX: Most functions in this file have two variants for inner nodes and leaf
> > + * nodes, therefore there are duplication codes. While this sometimes makes the
> > + * code maintenance tricky, this reduces branch prediction misses when judging
> > + * whether the node is a inner node of a leaf node.
> >
> > This comment seems to be out-of-date since we made it a template.
>
> Done in 0020, along with a bunch of other comment editing.
>
> > The following macros are defined but not undefined in radixtree.h:
>
> Fixed in v21-0018.
>
> Also:
>
> 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.

radixtree_search_impl.h still assumes that the value type is an
integer type as follows:

#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;

Assert(RT_NODE_IS_LEAF(node));
#else

Also, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.

> 0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
> 0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).
>
> The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.

0008 patch

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
- fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize
%zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+ fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
RT_SIZE_CLASS_INFO[i].name,
RT_SIZE_CLASS_INFO[i].inner_size,
- RT_SIZE_CLASS_INFO[i].inner_blocksize,
- RT_SIZE_CLASS_INFO[i].leaf_size,
- RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+ RT_SIZE_CLASS_INFO[i].leaf_size);

There is additional '%zu' at the end of the format string:

---
0011 patch

+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ * statments.

typo: s/statments/statements/

The rest look good to me. I'll incorporate these fixes in the next
version patch.

>
> > > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
> >
> > Agreed, will fix.
>
> Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )

Agreed. I'll change it in the next version patch.

>
> > > + * We calculate the maximum bytes for the TidStore in different ways
> > > + * for non-shared case and shared case. Please refer to the comment
> > > + * TIDSTORE_MEMORY_DEDUCT for details.
> > > + */
> > >
> > > Maybe the #define and comment should be close to here.
> >
> > Will fix.
>
> For this, I intended that "here" meant "in or just above the function".
>
> +#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
> +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
> +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
>
> These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.

Agreed, will fix.

>
> > > + * Destroy a TidStore, returning all memory. The caller must be certain that
> > > + * no other backend will attempt to access the TidStore before calling this
> > > + * function. Other backend must explicitly call tidstore_detach to free up
> > > + * backend-local memory associated with the TidStore. The backend that calls
> > > + * tidstore_destroy must not call tidstore_detach.
> > > + */
> > > +void
> > > +tidstore_destroy(TidStore *ts)
> > >
> > > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
> >
> > Will fix.
>
> Did anything change here?

Oops, the fix is missed in the patch for some reason. I'll fix it.

> There is also this, in the template, which I'm not sure has been addressed:
>
> * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
> * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> * We need either a safeguard to disallow other processes to begin the iteration
> * while one process is doing or to allow multiple processes to do the iteration.

It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?

>
> > > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
> >
> > I agree that we don't need complexity here. I'll try this idea.
>
> Keeping the offsets array in the prunestate seems to work out well.
>
> Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:
>
> TID store:
>
> + * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
> + *
> + * X = bits used for offset number
> + * Y = bits used for block number
> + * u = unused bit
>
> I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).

I borrowed it from ginpostinglist.c but it seems better to write in
the common order.

>
> + * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
> + * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
>
> + * XXX: if we want to support non-heap table AM that want to use the full
> + * range of possible offset numbers, we'll need to reconsider
> + * TIDSTORE_OFFSET_NBITS value.
>
> Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.

I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.

>
> +typedef dsa_pointer tidstore_handle;
>
> It's not clear why we need a typedef here, since here:
>
> +tidstore_attach(dsa_area *area, tidstore_handle handle)
> +{
> + TidStore *ts;
> + dsa_pointer control;
> ...
> + control = handle;
>
> ...there is a differently-named dsa_pointer variable that just gets the function parameter.

I guess one reason is to improve compatibility; we can stash the
actual value of the handle, which could help some cases, for example,
when we need to change the actual value of the handle. dshash.c uses
the same idea. Another reason would be to improve readability.

>
> +/* Return the maximum memory TidStore can use */
> +uint64
> +tidstore_max_memory(TidStore *ts)
>
> size_t is more suitable for memory.

WIll fix.

>
> + /*
> + * Since the shared radix tree supports concurrent insert,
> + * we don't need to acquire the lock.
> + */
>
> Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?

You're right. I was missing something. The lock should be taken before
adding key-value pairs.

>
> VACUUM integration:
>
> -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
> +#define PARALLEL_VACUUM_KEY_DSA 2
>
> Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.

Agreed, will remove.

>
> +#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
>
> This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:
>
> - if (lpdead_items > 0)
> + if (prunestate->lpdead_items > 0)

Will remove the macro.

>
> About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)

That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.

>
> Now might be a good time to look at earlier XXX comments and come up with a plan to address them.

Agreed.

Other XXX comments that are not mentioned yet are:

+ /* XXX: memory context support */
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));

I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.

---
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+ // XXX is this necessary?
+ Size total = sizeof(RT_RADIX_TREE);

Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.

---
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)

I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.

---
There is no XXX comment but I'll try to add lock support in the next
version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-01-25 01:42:44 bitscan forward/reverse on Windows
Previous Message Justin Pryzby 2023-01-25 01:42:03 Re: Add LZ4 compression in pg_dump