Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-01-31 05:49:21
Message-ID: CAD21AoBAROsGaRn846j6KfHO3_0xRsb3a_LxLwCjWJG8Hjw9wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 30, 2024 at 7:20 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> > > I meant the macro could probably be
> > >
> > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
> > >
> > > (Right now N=32). I also realize I didn't answer your question earlier
> > > about block sizes being powers of two. I was talking about PG in
> > > general -- I was thinking all block sizes were powers of two. If
> > > that's true, I'm not sure if it's because programmers find the macro
> > > calculations easy to reason about, or if there was an implementation
> > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> > > just above a power of two, so if we did round that up it would be
> > > 128kB.
> >
> > Thank you for your explanation. It might be better to follow other
> > codes. Does the calculation below make sense to you?
> >
> > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
> > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
> > while (inner_blocksize < 32 * size_class.allocsize)
> > inner_blocksize <<= 1;
>
> It does make sense, but we can do it more simply:
>
> Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))

Thanks!

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.
0013-0015: address review comments on tidstore codes.
0017-0018: address review comments on vacuum integration codes.

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

---
* XXX There are 4 node kinds, and this should never be increased,
* for several reasons:
* 1. With 5 or more kinds, gcc tends to use a jump table for switch
* statements.
* 2. The 4 kinds can be represented with 2 bits, so we have the option
* in the future to tag the node pointer with the kind, even on
* platforms with 32-bit pointers. This might speed up node traversal
* in trees with highly random node kinds.
* 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

---
* WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

---
#ifdef RT_SHMEM
/* WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

---
* WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
* inside a single bitmapword on most platforms, so it's a good starting
* point. We can make it higher if we need to.
*/
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

---
/* WIP: We could go first to the higher node16 size class */
newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

---
* TODO: The current locking mechanism is not optimized for high concurrency
* with mixed read-write workloads. In the future it might be worthwhile
* to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
* the paper "The ART of Practical Synchronization" by the same authors as
* the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4 4

Is there something we need to do here?

---
/*
* Return index of the chunk and slot arrays for inserting into the node,
* such that the chunk array remains ordered.
* TODO: Improve performance for non-SIMD platforms.
*/

Are you going to work on this?

---
/* Delete the element at 'idx' */
/* TODO: replace slow memmove's */

Are you going to work on this?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v56-ART.tar.gz application/x-gzip 64.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-01-31 05:51:03 Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix
Previous Message David G. Johnston 2024-01-31 05:43:57 Re: Possibility to disable `ALTER SYSTEM`