Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-12 05:44:14
Message-ID: CAD21AoA_VjvG6LDYvJyuezNPtH12bx0My5Ws88o1Gu4h9qX5GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > It looks no problem in terms of vacuum integration, although I've not
> > fully tested yet. TID store uses the radix tree as the main storage,
> > and with the template radix tree, the data types for shared and
> > non-shared will be different. TID store can have an union for the
> > radix tree and the structure would be like follows:
>
> > /* Storage for Tids */
> > union tree
> > {
> > local_radix_tree *local;
> > shared_radix_tree *shared;
> > };
>
> We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.

One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
bool is_shared;
MemoryContext context;
};

typedef struct rt_shared
{
rt_handle handle;
uint32 magic;

/* Root node */
dsa_pointer root;

uint64 max_val;
uint64 num_keys;

/* need a lwlock */

/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
} rt_shared;

struct radix_tree_shared
{
radix_tree rt;

rt_shared *shared;
dsa_area *area;
} radix_tree_shared;

struct radix_tree_local
{
radix_tree rt;

uint64 max_val;
uint64 num_keys;

rt_node *root;

/* used only when the radix tree is private */
MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];

/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
} radix_tree_local;

>
> > In the functions of TID store, we need to call either local or shared
> > radix tree functions depending on whether TID store is shared or not.
> > We need if-branch for each key-value pair insertion, but I think it
> > would not be a big performance problem in TID store use cases, since
> > vacuum is an I/O intensive operation in many cases.
>
> Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.
>
> > Overall, I think
> > there is no problem and I'll investigate it in depth.
>
> Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.

I agree to keep this as a template. From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

>
> > Apart from that, I've been considering the lock support for shared
> > radix tree. As we discussed before, the current usage (i.e, only
> > parallel index vacuum) doesn't require locking support at all, so it
> > would be enough to have a single lock for simplicity.
>
> Right, that should be enough for PG16.
>
> > If we want to
> > use the shared radix tree for other use cases such as the parallel
> > heap vacuum or the replacement of the hash table for shared buffers,
> > we would need better lock support.
>
> For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.
>
> > For example, if we want to support
> > Optimistic Lock Coupling[1],
>
> Interesting, from the same authors!

+1

>
> > we would need to change not only the node
> > structure but also the logic. Which probably leads to widen the gap
> > between the code for non-shared and shared radix tree. In this case,
> > once we have a better radix tree optimized for shared case, perhaps we
> > can replace the templated shared radix tree with it. I'd like to hear
> > your opinion on this line.
>
> I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.

>
> > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
> >
> > Oh, I didn't notice that. The chunk field was originally used when
> > redirecting the child pointer in the parent node from old to new
> > (grown) node. When redirecting the pointer, since the corresponding
> > chunk surely exists on the parent we can skip existence checks.
> > Currently we use RT_NODE_UPDATE_INNER() for that (see
> > RT_REPLACE_NODE()) but having a dedicated function to update the
> > existing chunk and child pointer might improve the performance. Or
> > reducing the node size by getting rid of the chunk field might be
> > better.
>
> I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).
>
> I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.

Okay, let's get rid of that in the v18.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-01-12 05:53:07 Re: Add a new pg_walinspect function to extract FPIs from WAL records
Previous Message Brar Piening 2023-01-12 05:34:44 Re: pgsql: Doc: add XML ID attributes to <sectN> and <varlistentry> tags.