Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-26 05:42:24
Message-ID: CAD21AoDC7CTSgVxBm=3pktVymv2pK8UtoM5KCa_KXsy6g1npfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I found the following comment and wanted to discuss:
> >
> > // this might be better as "iterate over nodes", plus a callback to
> > RT_DUMP_NODE,
> > // which should really only concern itself with single nodes
> > RT_SCOPE void
> > RT_DUMP(RT_RADIX_TREE *tree)
> >
> > If it means we need to somehow use the iteration functions also for
> > dumping the whole tree, it would probably need to refactor the
> > iteration codes so that the RT_DUMP() can use them while dumping
> > visited nodes. But we need to be careful of not adding overheads to
> > the iteration performance.
>
> Yeah, some months ago I thought a callback interface would make some
> things easier. I don't think we need that at the moment (possibly
> never), so that comment can be just removed. As far as these debug
> functions, I only found useful the stats and dumping a single node,
> FWIW.
>
> I've attached v47, which is v46 plus some fixes for radix tree.
>
> 0004 - moves everything for "delete" to the end -- gradually other
> things will be grouped together in a sensible order
>
> 0005 - trivial

LGTM.

>
> 0006 - shrink nodes -- still needs testing, but nothing crashes yet.

Cool. The coverage test results showed the shrink codes are also covered.

> This shows some renaming might be good: Previously we had
> RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
> added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
> not copying the slot to be deleted. This means when growing it would
> be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
> reserves a new slot for the caller in the new node, but the caller
> must do the insert itself.

Agreed.

> Note that there are some practical
> restrictions/best-practices on whether shrinking should happen after
> deletion or vice versa. Hopefully it's clear, but let me know if the
> description can be improved. Also, it doesn't yet shrink from size
> class 32 to 16, but it could with a bit of work.

Sounds reasonable.

>
> 0007 - trivial, but could use a better comment. I also need to make
> sure stats reporting works (may also need some cleanup work).
>
> 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
> if DSA could just free all our allocated segments without throwing
> away the DSA, and that's still a good question.

LGTM.

>
> 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
> this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
> back to that).

Will rename it.

> The assert doesn't fire, so I guess it does what it's
> supposed to?

Yes.

> For me, the iteration logic is still the most confusing
> piece out of the whole radix tree. Maybe that could be helped with
> some better variable names, but I wonder if it needs more invasive
> work.

True. Maybe more comments would also help.

>
> 0010 - some fixes for number of children accounting in node256
>
> 0011 - Long overdue pgindent of radixtree.h, without trying to fix up
> afterwards. Feel free to throw out and redo if this interferes with
> ongoing work.
>

LGTM.

I'm working on the below review comments and most of them are already
incorporated on the local branch:

> The rest are from your v46. The bench doesn't work for tid store
> anymore, so I squashed "disable bench for CI" until we get back to
> that. Some more review comments (note: patch numbers are for v47, but
> I changed nothing from v46 in this area):
>
> 0013:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> Recently outdated. The variable length values seems to work, so let's
> make everything match.
>
> +#define MAX_TUPLES_PER_PAGE MaxOffsetNumber
>
> Maybe we don't need this macro anymore? The name no longer fits, in any case.

Removed.

>
> +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
> + int num_offsets)
> +{
> + char buf[MaxBlocktableEntrySize];
> + BlocktableEntry *page = (BlocktableEntry *) buf;
>
> I'm not sure this is safe with alignment. Maybe rather than plain
> "char", it needs to be a union with BlocktableEntry, or something.

I tried it in the new patch set but could you explain why it could not
be safe with alignment?

>
> +static inline BlocktableEntry *
> +tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
> +{
> + if (TidStoreIsShared(iter->ts))
> + return shared_rt_iterate_next(iter->tree_iter.shared, key);
> +
> + return local_rt_iterate_next(iter->tree_iter.local, key);
> +}
>
> In the old encoding scheme, this function did something important, but
> now it's a useless wrapper with one caller.

Removed.

>
> + /*
> + * In the shared case, TidStoreControl and radix_tree are backed by the
> + * same DSA area and rt_memory_usage() returns the value including both.
> + * So we don't need to add the size of TidStoreControl separately.
> + */
> + if (TidStoreIsShared(ts))
> + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
> +
> + return sizeof(TidStore) + sizeof(TidStore) +
> local_rt_memory_usage(ts->tree.local);
>
> I don't see the point in including these tiny structs, since we will
> always blow past the limit by a number of kilobytes (at least, often
> megabytes or more) at the time it happens.

Agreed, removed.

>
> + iter->output.max_offset = 64;
>
> Maybe needs a comment that this is just some starting size and not
> anything particular.
>
> + iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);
>
> + /* Make sure there is enough space to add offsets */
> + if (result->num_offsets + bmw_popcount(w) > result->max_offset)
> + {
> + result->max_offset *= 2;
> + result->offsets = repalloc(result->offsets,
> + sizeof(OffsetNumber) * result->max_offset);
> + }
>
> popcount()-ing for every array element in every value is expensive --
> let's just add sizeof(bitmapword). It's not that wasteful, but then
> the initial max will need to be 128.

Good idea.

>
> About separation of responsibilities for locking: The only thing
> currently where the tid store is not locked is tree iteration. That's
> a strange exception. Also, we've recently made RT_FIND return a
> pointer, so the caller must somehow hold a share lock, but I think we
> haven't exposed callers the ability to do that, and we rely on the tid
> store lock for that. We have a mix of tree locking and tid store
> locking. We will need to consider carefully how to make this more
> clear, maintainable, and understandable.

Yes, tidstore should be locked during the iteration.

One simple direction about locking is that the radix tree has the lock
but no APIs hold/release it. It's the caller's responsibility. If a
data structure using a radix tree for its storage has its own lock
(like tidstore), it can use it instead of the radix tree's one. A
downside would be that it's probably hard to support a better locking
algorithm such as ROWEX in the radix tree. Another variant of APIs
that also does locking/unlocking within APIs might help.

>
> 0015:
>
> "XXX: some regression test fails since this commit changes the minimum
> m_w_m to 2048 from 1024. This was necessary for the pervious memory"
>
> This shouldn't fail anymore if the "one-place" clamp was in a patch
> before this. If so, lets take out that GUC change and worry about
> min/max size separately. If it still fails, I'd like to know why.

Agreed.

>
> - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
> - * vacrel->dead_items array.
> + * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in
> the TID store.
>
> What I was getting at earlier is that the first line here doesn't
> really need to change, we can just s/array/store/ ?

Fixed.

>
> -static int
> -lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> - int index, Buffer vmbuffer)
> +static void
> +lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
> + OffsetNumber *deadoffsets,
> int num_offsets, Buffer buffer,
> + Buffer vmbuffer)
>
> "buffer" should still come after "blkno", so that line doesn't need to change.

Fixed.

>
> $ git diff master -- src/backend/access/heap/ | grep has_lpdead_items
> - bool has_lpdead_items; /* includes existing LP_DEAD items */
> - * pruning and freezing. all_visible implies !has_lpdead_items, but don't
> - Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
> - if (prunestate.has_lpdead_items)
> - else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
> - if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
> - prunestate->has_lpdead_items = false;
> - prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items
> = true;
>
> In a green field, it'd be fine to replace these with an expression of
> "num_offsets", but it adds a bit of noise for reviewers and the git
> log. Is it really necessary?

I see your point. I think we can live with having both
has_lpdead_items and num_offsets. But we will have to check if these
values are consistent, which could be less maintainable.

>
> - deadoffsets[lpdead_items++] = offnum;
> +
> prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
>
> I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> moved to the PruneState. The latter was renamed in a way that makes
> more sense, but I don't see why the churn is necessary.
>
> @@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel,
> }
> #endif
>
> - /*
> - * Now save details of the LP_DEAD items from the page in vacrel
> - */
> - if (lpdead_items > 0)
> + if (prunestate->num_offsets > 0)
> {
> - VacDeadItems *dead_items = vacrel->dead_items;
> - ItemPointerData tmp;
> -
> vacrel->lpdead_item_pages++;
> - prunestate->has_lpdead_items = true;
> -
> - ItemPointerSetBlockNumber(&tmp, blkno);
> -
> - for (int i = 0; i < lpdead_items; i++)
> - {
> - ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
> - dead_items->items[dead_items->num_items++] = tmp;
> - }
> -
> - Assert(dead_items->num_items <= dead_items->max_items);
> - pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
> -
> dead_items->num_items);
>
> I don't understand why this block got removed and nothing new is
> adding anything to the tid store.
>
> @@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel)
> * with prunestate-driven visibility map and
> FSM steps (just like
> * the two-pass strategy).
> */
> - Assert(dead_items->num_items == 0);
> + Assert(TidStoreNumTids(dead_items) == 0);
> + }
> + else if (prunestate.num_offsets > 0)
> + {
> + /* Save details of the LP_DEAD items from the
> page in dead_items */
> + TidStoreSetBlockOffsets(dead_items, blkno,
> prunestate.deadoffsets,
> +
> prunestate.num_offsets);
> +
> +
> pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
> +
> TidStoreMemoryUsage(dead_items));
>
> I guess it was added here, 800 lines away? If so, why?

The above changes are related. The idea is not to use tidstore in a
one-pass strategy. If the table doesn't have any indexes, in
lazy_scan_prune() we collect offset numbers of dead tuples on the page
and vacuum the page using them. In this case, we don't need to use
tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
LVPagePruneState is a convenient place to store collected offset
numbers.

>
> About progress reporting: I want to make sure no one is going to miss
> counting "num_dead_tuples". It's no longer relevant for the number of
> index scans we need to do, but do admins still have a use for it?
> Something to think about later.

I'm not sure if the user will still need num_dead_tuples in progress
reporting view. The total number of dead tuples might be useful but
the verbose log already shows that.

>
> 0017
>
> + /*
> + * max_bytes is forced to be at least 64kB, the current minimum valid
> + * value for the work_mem GUC.
> + */
> + max_bytes = Max(64 * 1024L, max_bytes);
>
> If this still needs to be here, I still don't understand why.

Removed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-12-26 08:00:34 Re: Change GUC hashtable to use simplehash?
Previous Message Andrei Lepikhov 2023-12-26 04:53:34 Re: Optimization outcome depends on the index order