Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-26 06:54:43
Message-ID: CAFBsxsF2-xJkUMoQCO0hda_SHKXsz2WLWbp8ep3J_bKFWiozkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Mon, Jan 23, 2023 at 8:20 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
> > >
> > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > In v21, all of your v20 improvements to the radix tree template and
test have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.
>
> Hmm, I don't remember I proposed such a patch, either.

I went looking, and it turns out I remembered wrong, sorry.

> One idea to address it would be that we pass a shared memory to
> RT_CREATE() and we create a DSA area dedicated to the radix tree in
> place. We should return the created DSA area along with the radix tree
> so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> area. A downside of this idea would be that one DSA area only for a
> radix tree is always required.
>
> Another idea would be that we allocate a big enough DSA area and
> quarry small memory for nodes from there. But it would need to
> introduce another complexity so I prefer to avoid it.
>
> FYI the current design is inspired by dshash.c. In dshash_destory(),
> we dsa_free() each elements allocated by dshash.c

Okay, thanks for the info.

> > 0007 makes the value type configurable. Some debug functionality still
assumes integer type, but I think the rest is agnostic.
>
> radixtree_search_impl.h still assumes that the value type is an
> integer type as follows:
>
> #ifdef RT_NODE_LEVEL_LEAF
> RT_VALUE_TYPE value = 0;
>
> Assert(RT_NODE_IS_LEAF(node));
> #else
>
> Also, I think if we make the value type configurable, it's better to
> pass the pointer of the value to RT_SET() instead of copying the
> values since the value size could be large.

Thanks, I will remove the assignment and look into pass-by-reference.

> Oops, the fix is missed in the patch for some reason. I'll fix it.
>
> > There is also this, in the template, which I'm not sure has been
addressed:
> >
> > * XXX: Currently we allow only one process to do iteration. Therefore,
rt_node_iter
> > * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> > * We need either a safeguard to disallow other processes to begin the
iteration
> > * while one process is doing or to allow multiple processes to do the
iteration.
>
> It's not addressed yet. I think adding a safeguard is better for the
> first version. A simple solution is to add a flag, say iter_active, to
> allow only one process to enable the iteration. What do you think?

I don't quite have enough info to offer an opinion, but this sounds like a
different form of locking. I'm sure it's come up before, but could you
describe why iteration is different from other operations, regarding
concurrency?

> > Would it be worth it (or possible) to calculate constants based on
compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.
>
> I think we can pass the maximum offset numbers to tidstore_create()
> and calculate these values.

That would work easily for vacuumlazy.c, since it's in the "heap" subdir so
we know the max possible offset. I haven't looked at vacuumparallel.c, but
I can tell it is not in a heap-specific directory, so I don't know how easy
that would be to pass along the right value.

> > About shared memory: I have some mild reservations about the naming of
the "control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)
>
> That seems a valid concern. I borrowed the "control object" from
> dshash.c but it supports only shared cases. The fact that the radix
> tree supports both local and shared seems to introduce this confusion.
> I came up with other names such as RT_RADIX_TREE_CORE or
> RT_RADIX_TREE_ROOT but not sure these are better than the current
> one.

Okay, if dshash uses it, we have some precedent.

> > Now might be a good time to look at earlier XXX comments and come up
with a plan to address them.
>
> Agreed.
>
> Other XXX comments that are not mentioned yet are:
>
> + /* XXX: memory context support */
> + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
>
> I'm not sure we really need memory context support for RT_ATTACH()
> since in the shared case, we allocate backend-local memory only for
> RT_RADIX_TREE.

Okay, we can remove this.

> ---
> +RT_SCOPE uint64
> +RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
> +{
> + // XXX is this necessary?
> + Size total = sizeof(RT_RADIX_TREE);
>
> Regarding this, I followed intset_memory_usage(). But in the radix
> tree, RT_RADIX_TREE is very small so probably we can ignore it.

That was more a note to myself that I forgot about, so here is my
reasoning: In the shared case, we just overwrite that initial total, but
for the local case we add to it. A future reader could think this is
inconsistent and needs to be fixed. Since we deduct from the guc limit to
guard against worst-case re-allocation, and that deduction is not very
precise (nor needs to be), I agree we should just forget about tiny sizes
like this in both cases.

> ---
> +/* XXX For display, assumes value type is numeric */
> +static void
> +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
>
> I think we can display values in hex encoded format but given the
> value could be large, we don't necessarily need to display actual
> values. Displaying the tree structure and chunks would be helpful for
> debugging the radix tree.

Okay, I can try that unless you do it first.

> There is no XXX comment but I'll try to add lock support in the next
> version patch.

Since there were calls to LWLockAcquire/Release in the last version, I'm a
bit confused by this. Perhaps for the next patch, the email should contain
a few sentences describing how locking is intended to work, including for
iteration.

Hmm, I wonder if we need to use the isolation tester. It's both a blessing
and a curse that the first client of this data structure is tid lookup.
It's a blessing because it doesn't present a highly-concurrent workload
mixing reads and writes and so simple locking is adequate. It's a curse
because to test locking and have any chance of finding bugs, we can't rely
on vacuum to tell us that because (as you've said) it might very well work
fine with no locking at all. So we must come up with test cases ourselves.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-01-26 07:08:52 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Peter Geoghegan 2023-01-26 06:41:34 Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation