Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-03-25 06:25:17
Message-ID: CANWCAZbycz6MRWUmYA8GhubLNqVoby779=_iL28mRxKApMazgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:

> > v77-0001
> >
> > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
> > - dead_items->max_items = max_items;
> > - dead_items->num_items = 0;
> > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > +
> > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> >
> > This is confusing enough that it looks like a bug:
> >
> > [inside TidStoreCreate()]
> > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > while (16 * maxBlockSize > max_bytes * 1024L)
> > maxBlockSize >>= 1;
> >
> > This was copied from CreateWorkExprContext, which operates directly on
> > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > here. If we're passing something measured in kilobytes, the parameter
> > is badly named. Let's use convert once and use bytes everywhere.
>
> True. The attached 0001 patch fixes it.

v78-0001 and 02 are fine, but for 0003 there is a consequence that I
didn't see mentioned: vac_work_mem now refers to bytes, where before
it referred to kilobytes. It seems pretty confusing to use a different
convention from elsewhere, especially if it has the same name but
different meaning across versions. Worse, this change is buried inside
a moving-stuff-around diff, making it hard to see. Maybe "convert only
once" is still possible, but I was actually thinking of

+ dead_items_info->max_bytes = vac_work_mem * 1024L;
+ vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);

That way it's pretty obvious that it's correct. That may require a bit
of duplication and moving around for shmem, but there is some of that
already.

More on 0003:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs

+ * autovacuum_work_mem) memory space to keep track of dead TIDs. If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum

I wonder if the comments here should refer to it using a more natural
spelling, like "TID store".

- * items in the dead_items array for later vacuuming, count live and
+ * items in the dead_items for later vacuuming, count live and

Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?

- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.

Here maybe "into dead_items".

Also, "we line pointers" seems to be a pre-existing typo.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));

This is a translated message, so let's keep the message the same.

/*
* Allocate dead_items (either using palloc, or in dynamic shared memory).
* Sets dead_items in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
*/
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)

This comment didn't change at all. It's not wrong, but let's consider
updating the specifics.

v78-0004:

> > +#define dsa_create(tranch_id) \
> > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
> >
> > Since these macros are now referring to defaults, maybe their name
> > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> > (*_MAX_*)
>
> It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
> the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
> name also makes sense to me.

Right, that makes sense.

v78-0005:

"Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
still needed to clamp their own limits, which was not consistent and
user-friendly."

Perhaps s/still needed/would have needed/ ..., since we're preventing
that necessity.

> > Did you try it with 1MB m_w_m?
>
> I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

+ /*
+ * Choose the initial and maximum DSA segment sizes to be no longer
+ * than 1/16 and 1/8 of max_bytes, respectively. If the initial
+ * segment size is low, we end up having many segments, which risks
+ * exceeding the total number of segments the platform can have.

The second sentence is technically correct, but I'm not sure how it
relates to the code that follows.

+ while (16 * dsa_init_size > max_bytes)
+ dsa_init_size >>= 1;
+ while (8 * dsa_max_size > max_bytes)
+ dsa_max_size >>= 1;

I'm not sure we need a separate loop for "dsa_init_size". Can we just have :

while (8 * dsa_max_size > max_bytes)
dsa_max_size >>= 1;

if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
dsa_max_size = DSA_MIN_SEGMENT_SIZE;

if (dsa_init_size > dsa_max_size)
dsa_init_size = dsa_max_size;

@@ -113,13 +113,10 @@ static void
tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
* CurrentMemoryContext at the time of this call. The TID storage, backed
* by a radix tree, will live in its child memory context, rt_context. The
* TidStore will be limited to (approximately) max_bytes total memory
- * consumption. If the 'area' is non-NULL, the radix tree is created in the
- * DSA area.
- *
- * The returned object is allocated in backend-local memory.
+ * consumption.

The existing comment slipped past my radar, but max_bytes is not a
limit, it's a hint. Come to think of it, it never was a limit in the
normal sense, but in earlier patches it was the criteria for reporting
"I'm full" when asked.

void
TidStoreDestroy(TidStore *ts)
{
- /* Destroy underlying radix tree */
if (TidStoreIsShared(ts))
+ {
+ /* Destroy underlying radix tree */
shared_ts_free(ts->tree.shared);
+
+ dsa_detach(ts->area);
+ }
else
local_ts_free(ts->tree.local);

It's still destroyed in the local case, so not sure why this comment was moved?

v78-0006:

-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */

I don't see any use in core outside this module -- maybe it's possible
to renumber these?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Donghang Lin 2024-03-25 06:28:52 Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Previous Message shveta malik 2024-03-25 06:23:53 Re: Introduce XID age and inactive timeout based replication slot invalidation