Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-16 08:18:02
Message-ID: CAD21AoALOcuJVgwn+5EFSQccMh_hQAfBtiC5hLDMgpRwOxY4rA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > >
> > > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
>
> There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.

Thanks! cfbot complaints about some warnings but these are expected
(due to unused delete routines etc). But one reported error[1] might
be relevant with 0002 patch?

[05:44:11.759] "link" /MACHINE:x64
/OUT:src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.res
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"/nologo" "/release" "/nologo" "/DEBUG"
"/PDB:src/test\modules\test_radixtree\test_radixtree.pdb" "/DLL"
"/IMPLIB:src/test\modules\test_radixtree\test_radixtree.lib"
"/INCREMENTAL:NO" "/STACK:4194304" "/NOEXP" "/DEBUG:FASTLINK"
"/NOIMPLIB" "C:/cirrus/build/src/backend/postgres.exe.lib"
"wldap32.lib" "c:/openssl/1.1/lib/libssl.lib"
"c:/openssl/1.1/lib/libcrypto.lib" "ws2_32.lib" "kernel32.lib"
"user32.lib" "gdi32.lib" "winspool.lib" "shell32.lib" "ole32.lib"
"oleaut32.lib" "uuid.lib" "comdlg32.lib" "advapi32.lib"
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

> 0003 contains all v17 local-memory coding squashed together.

+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.

This comment seems to be out-of-date since we made it a template.

---
+#ifndef RT_COMMON
+#define RT_COMMON

What are we using this macro RT_COMMON for?

---
The following macros are defined but not undefined in radixtree.h:

RT_MAKE_PREFIX
RT_MAKE_NAME
RT_MAKE_NAME_
RT_SEARCH
UINT64_FORMAT_HEX
RT_NODE_SPAN
RT_NODE_MAX_SLOTS
RT_CHUNK_MASK
RT_MAX_SHIFT
RT_MAX_LEVEL
RT_NODE_125_INVALID_IDX
RT_GET_KEY_CHUNK
BM_IDX
BM_BIT
RT_NODE_KIND_4
RT_NODE_KIND_32
RT_NODE_KIND_125
RT_NODE_KIND_256
RT_NODE_KIND_COUNT
RT_PTR_LOCAL
RT_PTR_ALLOC
RT_INVALID_PTR_ALLOC
NODE_SLAB_BLOCK_SIZE

> 0004 perf test not updated but it doesn't build by default so it's fine for now

Okay.

> 0005 removes node.chunk as discussed, but does not change node4 fanout yet.

LGTM.

> 0006 is a small cleanup regarding setting node fanout.

LGTM.

> 0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.

+ /* XXX: do we need to set a callback on exit to detach dsa? */

In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.

---
+ dsa_free(tree->dsa, tree->ctl->handle); // XXX
+ //dsa_detach(tree->dsa);

Similar to above, I think we should not detach from the DSA area here.

Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.

--
- Assert(tree->root);
+ //Assert(tree->ctl->root);

I think we don't need this assertion in the first place. We check it
at the beginning of the function.

---

+#ifdef RT_NODE_LEVEL_LEAF
+ Assert(NODE_IS_LEAF(node));
+#else
+ Assert(!NODE_IS_LEAF(node));
+#endif
+

I think we can move this change to 0003 patch.

> 0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.

LGTM.

>
> 0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).
>
> > I've implemented the idea of using union. Let me share WIP code for
> > discussion, I've attached three patches that can be applied on top of
>
> Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.

+1

>
> > Overall, TidStore implementation with the union idea doesn't look so
> > ugly to me. But I got many compiler warning about unused radix tree
> > functions like:
> >
> > tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> > [-Wunused-function]
> >
> > I'm not sure there is a convenient way to suppress this warning but
> > one idea is to have some macros to specify what operations are
> > enabled/declared.
>
> That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.

Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.

> It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)

Agreed.

>
> Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted to ask about a few things (numbers referring to v17 addendum, not v18):
>
> 0011
>
> + * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
> + * bytes a TidStore can use. These two fields are commonly used in both
> + * non-shared case and shared case.
> + */
> + uint32 num_tids;
>
> uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.

Agreed, will fix.

>
> + * We calculate the maximum bytes for the TidStore in different ways
> + * for non-shared case and shared case. Please refer to the comment
> + * TIDSTORE_MEMORY_DEDUCT for details.
> + */
>
> Maybe the #define and comment should be close to here.

Will fix.

>
> + * Destroy a TidStore, returning all memory. The caller must be certain that
> + * no other backend will attempt to access the TidStore before calling this
> + * function. Other backend must explicitly call tidstore_detach to free up
> + * backend-local memory associated with the TidStore. The backend that calls
> + * tidstore_destroy must not call tidstore_detach.
> + */
> +void
> +tidstore_destroy(TidStore *ts)
>
> If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.

Will fix.

>
> + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> + * in 'offsets' are ordered in ascending order.
>
> Must? What happens otherwise?

It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?

>
> + uint64 last_key = PG_UINT64_MAX;
>
> I'm having some difficulty understanding this sentinel and how it's used.

Will improve the logic.

>
> @@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
> if (prunestate.has_lpdead_items)
> {
> Size freespace;
> + TidStoreIter *iter;
> + TidStoreIterResult *result;
>
> - lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
> + iter = tidstore_begin_iterate(vacrel->dead_items);
> + result = tidstore_iterate_next(iter);
> + lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
> + buf, &vmbuffer);
> + Assert(!tidstore_iterate_next(iter));
> + tidstore_end_iterate(iter);
>
> /* Forget the LP_DEAD items that we just vacuumed */
> - dead_items->num_items = 0;
> + tidstore_reset(dead_items);
>
> This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.

I agree that we don't need complexity here. I'll try this idea.

>
> On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > I've written a simple script to simulate the DSA memory usage and the
> > limit. The 75% limit works fine for a power of two cases, and we can
> > use the 60% limit for other cases (it seems we can use up to about 66%
> > but used 60% for safety). It would be best if we can mathematically
> > prove it but I could prove only the power of two cases. But the script
> > practically shows the 60% threshold would work for these cases.
>
> Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increases segment size.

Agreed.

Since it seems you're working on another cleanup, I can address the
above comments after your work is completed. But I'm also fine with
including them into your cleanup work.

Regards,

[1] https://cirrus-ci.com/task/5078505327689728

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Pyhalov 2023-01-16 08:18:08 Inconsistency in vacuum behavior
Previous Message Michael Paquier 2023-01-16 07:36:01 Re: recovery modules