Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-21 07:41:37
Message-ID: CANWCAZZcPgb05bRw1W7TQdisK6bo32gTvOL0YRztwh1ghq=QZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> I found the following comment and wanted to discuss:
>
> // this might be better as "iterate over nodes", plus a callback to
> RT_DUMP_NODE,
> // which should really only concern itself with single nodes
> RT_SCOPE void
> RT_DUMP(RT_RADIX_TREE *tree)
>
> If it means we need to somehow use the iteration functions also for
> dumping the whole tree, it would probably need to refactor the
> iteration codes so that the RT_DUMP() can use them while dumping
> visited nodes. But we need to be careful of not adding overheads to
> the iteration performance.

Yeah, some months ago I thought a callback interface would make some
things easier. I don't think we need that at the moment (possibly
never), so that comment can be just removed. As far as these debug
functions, I only found useful the stats and dumping a single node,
FWIW.

I've attached v47, which is v46 plus some fixes for radix tree.

0004 - moves everything for "delete" to the end -- gradually other
things will be grouped together in a sensible order

0005 - trivial

0006 - shrink nodes -- still needs testing, but nothing crashes yet.
This shows some renaming might be good: Previously we had
RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
not copying the slot to be deleted. This means when growing it would
be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
reserves a new slot for the caller in the new node, but the caller
must do the insert itself. Note that there are some practical
restrictions/best-practices on whether shrinking should happen after
deletion or vice versa. Hopefully it's clear, but let me know if the
description can be improved. Also, it doesn't yet shrink from size
class 32 to 16, but it could with a bit of work.

0007 - trivial, but could use a better comment. I also need to make
sure stats reporting works (may also need some cleanup work).

0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
if DSA could just free all our allocated segments without throwing
away the DSA, and that's still a good question.

0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
back to that). The assert doesn't fire, so I guess it does what it's
supposed to? For me, the iteration logic is still the most confusing
piece out of the whole radix tree. Maybe that could be helped with
some better variable names, but I wonder if it needs more invasive
work. I confess I don't have better ideas for how it would work
differently.

0010 - some fixes for number of children accounting in node256

0011 - Long overdue pgindent of radixtree.h, without trying to fix up
afterwards. Feel free to throw out and redo if this interferes with
ongoing work.

The rest are from your v46. The bench doesn't work for tid store
anymore, so I squashed "disable bench for CI" until we get back to
that. Some more review comments (note: patch numbers are for v47, but
I changed nothing from v46 in this area):

0013:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

Recently outdated. The variable length values seems to work, so let's
make everything match.

+#define MAX_TUPLES_PER_PAGE MaxOffsetNumber

Maybe we don't need this macro anymore? The name no longer fits, in any case.

+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;

I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.

+static inline BlocktableEntry *
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key);
+}

In the old encoding scheme, this function did something important, but
now it's a useless wrapper with one caller.

+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);

I don't see the point in including these tiny structs, since we will
always blow past the limit by a number of kilobytes (at least, often
megabytes or more) at the time it happens.

+ iter->output.max_offset = 64;

Maybe needs a comment that this is just some starting size and not
anything particular.

+ iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);

+ /* Make sure there is enough space to add offsets */
+ if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ {
+ result->max_offset *= 2;
+ result->offsets = repalloc(result->offsets,
+ sizeof(OffsetNumber) * result->max_offset);
+ }

popcount()-ing for every array element in every value is expensive --
let's just add sizeof(bitmapword). It's not that wasteful, but then
the initial max will need to be 128.

About separation of responsibilities for locking: The only thing
currently where the tid store is not locked is tree iteration. That's
a strange exception. Also, we've recently made RT_FIND return a
pointer, so the caller must somehow hold a share lock, but I think we
haven't exposed callers the ability to do that, and we rely on the tid
store lock for that. We have a mix of tree locking and tid store
locking. We will need to consider carefully how to make this more
clear, maintainable, and understandable.

0015:

"XXX: some regression test fails since this commit changes the minimum
m_w_m to 2048 from 1024. This was necessary for the pervious memory"

This shouldn't fail anymore if the "one-place" clamp was in a patch
before this. If so, lets take out that GUC change and worry about
min/max size separately. If it still fails, I'd like to know why.

- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in
the TID store.

What I was getting at earlier is that the first line here doesn't
really need to change, we can just s/array/store/ ?

-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+ OffsetNumber *deadoffsets,
int num_offsets, Buffer buffer,
+ Buffer vmbuffer)

"buffer" should still come after "blkno", so that line doesn't need to change.

$ git diff master -- src/backend/access/heap/ | grep has_lpdead_items
- bool has_lpdead_items; /* includes existing LP_DEAD items */
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
- if (prunestate.has_lpdead_items)
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
- prunestate->has_lpdead_items = false;
- prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items
= true;

In a green field, it'd be fine to replace these with an expression of
"num_offsets", but it adds a bit of noise for reviewers and the git
log. Is it really necessary?

- deadoffsets[lpdead_items++] = offnum;
+
prunestate->deadoffsets[prunestate->num_offsets++] = offnum;

I'm also not quite sure why "deadoffsets" and "lpdead_items" got
moved to the PruneState. The latter was renamed in a way that makes
more sense, but I don't see why the churn is necessary.

@@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel,
}
#endif

- /*
- * Now save details of the LP_DEAD items from the page in vacrel
- */
- if (lpdead_items > 0)
+ if (prunestate->num_offsets > 0)
{
- VacDeadItems *dead_items = vacrel->dead_items;
- ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
- prunestate->has_lpdead_items = true;
-
- ItemPointerSetBlockNumber(&tmp, blkno);
-
- for (int i = 0; i < lpdead_items; i++)
- {
- ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
- dead_items->items[dead_items->num_items++] = tmp;
- }
-
- Assert(dead_items->num_items <= dead_items->max_items);
- pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-
dead_items->num_items);

I don't understand why this block got removed and nothing new is
adding anything to the tid store.

@@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and
FSM steps (just like
* the two-pass strategy).
*/
- Assert(dead_items->num_items == 0);
+ Assert(TidStoreNumTids(dead_items) == 0);
+ }
+ else if (prunestate.num_offsets > 0)
+ {
+ /* Save details of the LP_DEAD items from the
page in dead_items */
+ TidStoreSetBlockOffsets(dead_items, blkno,
prunestate.deadoffsets,
+
prunestate.num_offsets);
+
+
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+
TidStoreMemoryUsage(dead_items));

I guess it was added here, 800 lines away? If so, why?

About progress reporting: I want to make sure no one is going to miss
counting "num_dead_tuples". It's no longer relevant for the number of
index scans we need to do, but do admins still have a use for it?
Something to think about later.

0017

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

If this still needs to be here, I still don't understand why.

Attachment Content-Type Size
v47-ART.tar.gz application/gzip 83.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-12-21 08:01:26 Re: Remove MSVC scripts from the tree
Previous Message Andres Freund 2023-12-21 07:39:15 Re: Remove MSVC scripts from the tree