Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-11-04 15:24:23
Message-ID: CAD21AoBp2ErXCc01t0+jwBrNjRA=FkKPz1-yNNFcNC-PRsG5TA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 3, 2022 at 1:59 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> > the comments I got so far. 0004 patch is a DSA support patch for PoC.
>
> Thanks for the new patchset. This is not a full review, but I have some comments:
>
> 0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following:
>
> > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> > to point its children, and we use rt_node_ptr as either rt_node* or
> > dsa_pointer depending on whether the radix tree is shared or not (ie,
> > by checking radix_tree->dsa == NULL).
>

Thank you for the comments!

> 0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:
>
> - if (found && child_p)
> - *child_p = child;
> + if (found && childp_p)
> + *childp_p = childp;
> ...
> rt_node_inner_32 *new32;
> + rt_node_ptr new32p;
>
> /* grow node from 4 to 32 */
> - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
> - RT_NODE_KIND_32);
> + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
> + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
>
> It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with:
>
> typedef struct rt_node_ptr
> {
> uintptr_t encoded;
> rt_node * decoded;
> }
>
> Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier.

Good idea. Will try in the next version patch.

>
> Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it?

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

>
> Regarding the performance, I've
> > added another boolean argument to bench_seq/shuffle_search(),
> > specifying whether to use the shared radix tree or not. Here are
> > benchmark results in my environment,
>
> > [...]
>
> > In non-shared radix tree cases (the forth argument is false), I don't
> > see a visible performance degradation. On the other hand, in shared
> > radix tree cases (the forth argument is true), I see visible overheads
> > because of dsa_get_address().
>
> Thanks, this is useful.
>
> > Please note that the current shared radix tree implementation doesn't
> > support any locking, so it cannot be read while written by someone.
>
> I think at the very least we need a global lock to enforce this.
>
> > Also, only one process can iterate over the shared radix tree. When it
> > comes to parallel vacuum, these don't become restriction as the leader
> > process writes the radix tree while scanning heap and the radix tree
> > is read by multiple processes while vacuuming indexes. And only the
> > leader process can do heap vacuum by iterating the key-value pairs in
> > the radix tree. If we want to use it for other cases too, we would
> > need to support locking, RCU or something.
>
> A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be.

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
also introduces rt_detach() that was missed in 0002 patch. It's a very
rough patch but I hope it helps in considering lazy vacuum
integration, radix tree APIs, and shared radix tree functionality.
There are some TODOs:

* We need to reset the TIDStore and therefore reset the radix tree. It
can easily be done by using MemoryContextReset() in non-shared radix
tree cases, but in shared case, we need either to free all radix tree
nodes recursively or introduce a way to release all allocated DSA
memory.

* We need to limit the size of TIDStore (mainly radix_tree) in
maintenance_work_mem.

* We need to change the counter-based information in
pg_stat_progress_vacuum such as max_dead_tuples and num_dead_tuplesn.
I think it would be better to show maximum bytes we can collect TIDs
and its usage instead.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v8-0005-PoC-lazy-vacuum-integration.patch application/octet-stream 35.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-11-04 15:45:35 Re: psql: Add command to use extended query protocol
Previous Message Laurenz Albe 2022-11-04 15:17:28 Re: New docs chapter on Transaction Management and related changes