Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-23 11:20:06
Message-ID: CAFBsxsEdQPnvgyJRpVS+cWxQpe-Uq23ou0G0ukfK9-_pCxszZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:

In v21, all of your v20 improvements to the radix tree template and test
have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.

> + * XXX: Most functions in this file have two variants for inner nodes
and leaf
> + * nodes, therefore there are duplication codes. While this sometimes
makes the
> + * code maintenance tricky, this reduces branch prediction misses when
judging
> + * whether the node is a inner node of a leaf node.
>
> This comment seems to be out-of-date since we made it a template.

Done in 0020, along with a bunch of other comment editing.

> The following macros are defined but not undefined in radixtree.h:

Fixed in v21-0018.

Also:

0007 makes the value type configurable. Some debug functionality still
assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support
(builds with warnings, but okay because not meant for commit).

The rest are cleanups, small refactorings, and more comment rewrites. I've
kept them separate for visibility. Next patch can squash them unless there
is any discussion.

> > uint32 is how we store the block number, so this too small and will
wrap around on overflow. int64 seems better.
>
> Agreed, will fix.

Great, but it's now uint64, not int64. All the large counters in struct
LVRelState, for example, are signed integers, as the usual practice.
Unsigned ints are "usually" for things like bit patterns and where explicit
wraparound is desired. There's probably more that can be done here to
change to signed types, but I think it's still a bit early to get to that
level of nitpicking. (Soon, I hope :-) )

> > + * We calculate the maximum bytes for the TidStore in different ways
> > + * for non-shared case and shared case. Please refer to the comment
> > + * TIDSTORE_MEMORY_DEDUCT for details.
> > + */
> >
> > Maybe the #define and comment should be close to here.
>
> Will fix.

For this, I intended that "here" meant "in or just above the function".

+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6

These symbols are used only once, in tidstore_create(), and are difficult
to read. That function has few comments. The symbols have several
paragraphs, but they are far away. It might be better for readability to
just hard-code numbers in the function, with the explanation about the
numbers near where they are used.

> > + * Destroy a TidStore, returning all memory. The caller must be
certain that
> > + * no other backend will attempt to access the TidStore before calling
this
> > + * function. Other backend must explicitly call tidstore_detach to
free up
> > + * backend-local memory associated with the TidStore. The backend that
calls
> > + * tidstore_destroy must not call tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > If not addressed by next patch, need to phrase comment with FIXME or
TODO about making certain.
>
> Will fix.

Did anything change here? There is also this, in the template, which I'm
not sure has been addressed:

* XXX: Currently we allow only one process to do iteration. Therefore,
rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the
iteration
* while one process is doing or to allow multiple processes to do the
iteration.

> > This part only runs "if (vacrel->nindexes == 0)", so seems like
unneeded complexity. It arises because lazy_scan_prune() populates the tid
store even if no index vacuuming happens. Perhaps the caller of
lazy_scan_prune() could pass the deadoffsets array, and upon returning,
either populate the store or call lazy_vacuum_heap_page(), as needed. It's
quite possible I'm missing some detail, so some description of the design
choices made would be helpful.
>
> I agree that we don't need complexity here. I'll try this idea.

Keeping the offsets array in the prunestate seems to work out well.

Some other quick comments on tid store and vacuum, not comprehensive. Let
me know if I've misunderstood something:

TID store:

+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit

I was confused for a while, and I realized the bits are in reverse order
from how they are usually pictured (high on left, low on the right).

+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage <
2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with

+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.

Would it be worth it (or possible) to calculate constants based on
compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.

+typedef dsa_pointer tidstore_handle;

It's not clear why we need a typedef here, since here:

+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;

...there is a differently-named dsa_pointer variable that just gets the
function parameter.

+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)

size_t is more suitable for memory.

+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */

Hmm? IIUC, the caller only acquires the lock after returning from here, to
update statistics. Why is it safe to insert with no lock? Am I missing
something?

VACUUM integration:

-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2

Seems like unnecessary churn? It is still all about dead items, after all.
I understand using "DSA" for the LWLock, since that matches surrounding
code.

+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)

This macro helps the patch readability in some places, but I'm not sure it
helps readability of the file as a whole. The following is in the patch and
seems perfectly clear without the macro:

- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)

About shared memory: I have some mild reservations about the naming of the
"control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)

Now might be a good time to look at earlier XXX comments and come up with a
plan to address them.

That's all I have for now.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v21-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch text/x-patch 6.1 KB
v21-0001-introduce-vector8_min-and-vector8_highbit_mask.patch text/x-patch 2.7 KB
v21-0005-Restore-RT_GROW_NODE_KIND.patch text/x-patch 7.7 KB
v21-0004-Clean-up-some-nomenclature-around-node-insertion.patch text/x-patch 9.0 KB
v21-0003-Add-radixtree-template.patch text/x-patch 109.9 KB
v21-0006-Free-all-radix-tree-nodes-recursively.patch text/x-patch 3.2 KB
v21-0009-Remove-hard-coded-128.patch text/x-patch 5.1 KB
v21-0008-Streamline-calculation-of-slab-blocksize.patch text/x-patch 4.9 KB
v21-0010-Reduce-node4-to-node3.patch text/x-patch 21.5 KB
v21-0007-Make-value-type-configurable.patch text/x-patch 19.4 KB
v21-0012-Tool-for-measuring-radix-tree-performance.patch text/x-patch 22.0 KB
v21-0013-Get-rid-of-NODE_IS_EMPTY-macro.patch text/x-patch 1.6 KB
v21-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patch text/x-patch 1.9 KB
v21-0014-Add-some-comments-for-insert-logic.patch text/x-patch 3.5 KB
v21-0011-Expand-commentary-for-kinds-vs.-size-classes.patch text/x-patch 4.6 KB
v21-0019-Standardize-on-testing-for-is-leaf.patch text/x-patch 6.3 KB
v21-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patch text/x-patch 2.2 KB
v21-0017-Remove-some-maintenance-hazards-in-growing-nodes.patch text/x-patch 9.4 KB
v21-0018-Clean-up-symbols.patch text/x-patch 12.9 KB
v21-0020-Do-some-rewriting-and-proofreading-of-comments.patch text/x-patch 14.5 KB
v21-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch text/x-patch 33.1 KB
v21-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch text/x-patch 39.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-01-23 11:42:57 Re: TAP output format in pg_regress
Previous Message Dilip Kumar 2023-01-23 11:17:09 Re: New strategies for freezing, advancing relfrozenxid early