Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-15 08:15:09
Message-ID: CAD21AoCfiNE7ze65QwCbhq-96RB8krAUzatFPvniM3JwTXqGdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > In v45, 0001 - 0006 are from earlier versions but I've merged previous
> > updates. So the radix tree now has RT_SET() and RT_FIND() but not
> > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> > versions that incorporated the above comments. 0009 patch integrates
> > tidstore with lazy vacuum.
>
> Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from
>
> https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com
>
> ...and got similar results, so we still have good space-efficiency on this test:
>
> master:
> INFO: finished vacuuming "john.public.test": index scans: 9
> system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s
>
> v45:
> INFO: finished vacuuming "john.public.test": index scans: 1
> system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

Thank you for testing it again. That's a very good result.

> For my next steps, I will finish the node-shrinking behavior and save
> for a later patchset. Not needed for tid store, but needs to happen
> because of assumptions in the code. Also, some time ago, I think I
> commented out RT_FREE_RECURSE to get something working, so I'll fix
> it, and look at other fixmes and todos.

Great!

>
> > Note that DSA segment problem is not
> > resolved yet in this patch.
>
> I remember you started a separate thread about this, but I don't think
> it got any attention. Maybe reply with a "TLDR;" and share a patch to
> allow controlling max segment size.

Yeah, I recalled that thread. Will send a reply.

>
> Some more comments:
>
> v45-0003:
>
> Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
> RT_ITERATE_NEXT anymore? The former should handle fixed-length values
> just fine? If so, we should rename it to match the latter.

Agreed to rename it.

>
> + * The caller is responsible for locking/unlocking the tree in shared mode.
>
> This is not new to v45, but this will come up again below. This needs
> more explanation: Since we're returning a pointer (to support
> variable-length values), the caller needs to maintain control until
> it's finished with the value.

Will fix.

>
> v45-0005:
>
> + * Regarding the concurrency support, we use a single LWLock for the TidStore.
> + * The TidStore is exclusively locked when inserting encoded tids to the
> + * radix tree or when resetting itself. When searching on the TidStore or
> + * doing the iteration, it is not locked but the underlying radix tree is
> + * locked in shared mode.
>
> This is just stating facts without giving any reasons. Readers are
> going to wonder why it's inconsistent. The "why" is much more
> important than the "what". Even with that, this comment is also far
> from the relevant parts, and so will get out of date. Maybe we can
> just make sure each relevant function is explained individually.

Right, I'll fix it.

>
> v45-0007:
>
> -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
> +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);
>
> Tid store calls this max_bytes -- can we use that name here, too?
> "work_mem" is highly specific.

While I agree that "work_mem" is highly specific, I avoided using
"max_bytes" in radix tree because "max_bytes" sounds to me there is a
memory limitation but the radix tree doesn't have it actually. It
might be sufficient to mention it in the comment, though.

>
> - RT_PTR_ALLOC *slot;
> + RT_PTR_ALLOC *slot = NULL;
>
> We have a macro for invalid pointer because of DSA.

Will fix.

>
> v45-0008:
>
> - if (off < 1 || off > MAX_TUPLES_PER_PAGE)
> + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
> elog(ERROR, "tuple offset out of range: %u", off);
>
> This is a superfluous distraction, since the error path is located way
> off in the cold segment of the binary.

Okay, will remove it.

>
> v45-0009:
>
> (just a few small things for now)
>
> - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
> - * vacrel->dead_items array.
> + * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
>
> I think we can keep as "listed in the TID store".
>
> - * Allocate dead_items (either using palloc, or in dynamic shared memory).
> - * Sets dead_items in vacrel for caller.
> + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
> + * in vacrel for caller.
>
> I think we want to keep "in dynamic shared memory". It's still true.
> I'm not sure anything needs to change here, actually.

Agreed with above comments. Will fix them.

>
> parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> - int nrequested_workers, int max_items,
> - int elevel, BufferAccessStrategy bstrategy)
> + int nrequested_workers, int vac_work_mem,
> + int max_offset, int elevel,
> + BufferAccessStrategy bstrategy)
>
> It seems very strange to me that this function has to pass the
> max_offset. In general, it's been simpler to assume we have a constant
> max_offset, but in this case that fact is not helping. Something to
> think about for later.

max_offset was previously used in old TID encoding in tidstore. Since
tidstore has entries for each block, I think we no longer need it.

>
> - (errmsg("scanned index \"%s\" to remove %d row versions",
> + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
>
> This should be signed int64.

Will fix.

>
> v45-0010:
>
> Thinking about this some more, I'm not sure we need to do anything
> different for the *starting* segment size. (Controlling *max* size
> does seem important, however.) For the corner case of m_w_m = 1MB,
> it's fine if vacuum quits pruning immediately after (in effect) it
> finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> the memory accounting starts >1MB because we're adding the trivial
> size of some struct, let's just stop doing that. The segment
> allocations are what we care about.

IIUC it's for work_mem, whose the minimum value is 64kB.

>
> v45-0011:
>
> + /*
> + * max_bytes is forced to be at least 64kB, the current minimum valid
> + * value for the work_mem GUC.
> + */
> + max_bytes = Max(64 * 1024L, max_bytes);
>
> Why?

This is to avoid creating a radix tree within very small memory. The
minimum work_mem value is a reasonable lower bound that PostgreSQL
uses internally. It's actually copied from tuplesort.c.

>I believe I mentioned months ago that copying a hard-coded value
> that can get out of sync is not maintainable, but I don't even see the
> point of this part.

True.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2023-12-15 08:36:08 Re: Statistics Import and Export
Previous Message Andy Fan 2023-12-15 07:51:10 Re: Is a clearer memory lifespan for outerTuple and innerTuple useful?