Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-15 01:30:18
Message-ID: CANWCAZZYSeB3i+UeSBjihcT1FuaQLiCkQCjGHp8hvPpXp9mm5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> In v45, 0001 - 0006 are from earlier versions but I've merged previous
> updates. So the radix tree now has RT_SET() and RT_FIND() but not
> RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> versions that incorporated the above comments. 0009 patch integrates
> tidstore with lazy vacuum.

Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from

https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com

...and got similar results, so we still have good space-efficiency on this test:

master:
INFO: finished vacuuming "john.public.test": index scans: 9
system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s

v45:
INFO: finished vacuuming "john.public.test": index scans: 1
system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

More sparse TID distributions won't be as favorable, but we have ideas
to improve that in the future.

For my next steps, I will finish the node-shrinking behavior and save
for a later patchset. Not needed for tid store, but needs to happen
because of assumptions in the code. Also, some time ago, I think I
commented out RT_FREE_RECURSE to get something working, so I'll fix
it, and look at other fixmes and todos.

> Note that DSA segment problem is not
> resolved yet in this patch.

I remember you started a separate thread about this, but I don't think
it got any attention. Maybe reply with a "TLDR;" and share a patch to
allow controlling max segment size.

Some more comments:

v45-0003:

Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
RT_ITERATE_NEXT anymore? The former should handle fixed-length values
just fine? If so, we should rename it to match the latter.

+ * The caller is responsible for locking/unlocking the tree in shared mode.

This is not new to v45, but this will come up again below. This needs
more explanation: Since we're returning a pointer (to support
variable-length values), the caller needs to maintain control until
it's finished with the value.

v45-0005:

+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.

This is just stating facts without giving any reasons. Readers are
going to wonder why it's inconsistent. The "why" is much more
important than the "what". Even with that, this comment is also far
from the relevant parts, and so will get out of date. Maybe we can
just make sure each relevant function is explained individually.

v45-0007:

-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);

Tid store calls this max_bytes -- can we use that name here, too?
"work_mem" is highly specific.

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;

We have a macro for invalid pointer because of DSA.

v45-0008:

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
elog(ERROR, "tuple offset out of range: %u", off);

This is a superfluous distraction, since the error path is located way
off in the cold segment of the binary.

v45-0009:

(just a few small things for now)

- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- * vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.

I think we can keep as "listed in the TID store".

- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.

I think we want to keep "in dynamic shared memory". It's still true.
I'm not sure anything needs to change here, actually.

parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)

It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.

- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",

This should be signed int64.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
the memory accounting starts >1MB because we're adding the trivial
size of some struct, let's just stop doing that. The segment
allocations are what we care about.

v45-0011:

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

Why? I believe I mentioned months ago that copying a hard-coded value
that can get out of sync is not maintainable, but I don't even see the
point of this part.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wenhui qiu 2023-12-15 01:51:38 Re: Add 64-bit XIDs into PostgreSQL 15
Previous Message John Naylor 2023-12-15 01:20:12 Re: Change GUC hashtable to use simplehash?