Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-20 05:56:15
Message-ID: CAD21AoAFiw2DS07nPhiS14eVXpnF-bBmjvab4y+Ct8igU2z_8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2023 at 6:23 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> > > > I can think that something like traversing a HOT chain could visit
> > > > offsets out of order. But fortunately we prune such collected TIDs
> > > > before heap vacuum in heap case.
> > >
> > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
> >
> > Right. TidStore is implemented not only for heap, so loading
> > out-of-order TIDs might be important in the future.
>
> That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.
>
> > We have the following WIP comment in test_radixtree:
> >
> > // WIP: compiles with warnings because rt_attach is defined but not used
> > // #define RT_SHMEM
> >
> > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> > and friends?
>
> Sounds good to me, and the other fixes make sense as well.

Thanks, I merged them.

>
> > FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> > seems to work fine.
>
> That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)

According to the doc, the minimum block size is 1kB. It seems to work
fine with 1kB blocks.

>
> > You removed the vacuum integration patch from v27, is there any reason for that?
>
> Just an oversight.
>
> Now for some general comments on the tid store...
>
> + * TODO: The caller must be certain that no other backend will attempt to
> + * access the TidStore before calling this function. Other backend must
> + * explicitly call tidstore_detach to free up backend-local memory associated
> + * with the TidStore. The backend that calls tidstore_destroy must not call
> + * tidstore_detach.
> + */
> +void
> +tidstore_destroy(TidStore *ts)
>
> Do we need to do anything for this todo?

Since it's practically no problem, I think we can live with it for
now. dshash also has the same todo.

>
> It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.
>
> The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".
>
> I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
>
> maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
>
> It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?
>
> The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.
>
> Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.
>
> I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.

The attached 0008 patch addressed all above comments on tidstore.

> Some comments on vacuum:
>
> I think we'd better get some real-world testing of this, fairly soon.
>
> I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed.

Yeah, I did a similar thing in an earlier version of tidstore patch.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.

> Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about.

Thanks!

>
> We also want to verify that progress reporting works as designed and has no weird corner cases.
>
> * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
> ...
> + * create a TidStore with the maximum bytes that can be used by the TidStore.
>
> This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound.

Agreed.

>
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
> + vacrel->relname, tidstore_num_tids(vacrel->dead_items),
> + vacuumed_pages)));
>
> I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?

I think we need to change the format to INT64_FORMAT.

>
> - * the memory space for storing dead items allocated in the DSM segment. We
> [a lot of whitespace adjustment]
> + * the shared TidStore. We launch parallel worker processes at the start of
>
> The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.
>
> - /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
> - est_dead_items_len = vac_max_items_to_alloc_size(max_items);
> - shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
> + /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
> + shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
>
> If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
> What does dsa_minimum_size() work out to in practice? 1MB?
> Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.
>

Right. The attached 0009 patch addressed comments on vacuum
integration except for the correctness checking.

> Lastly, on the radix tree:
>
> I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?
>
> RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.

It was used in radixtree_iter_impl.h. But I removed it as it was not necessary.

>
> + /*
> + * Set the node to the node iterator and update the iterator stack
> + * from this node.
> + */
> + RT_UPDATE_ITER_STACK(iter, child, level - 1);
>
> +/*
> + * Update each node_iter for inner nodes in the iterator node stack.
> + */
> +static void
> +RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
>
> These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description.
>

I agree with all of the above comments. The attached 0007 patch
addressed comments on the radix tree.

> In the test:
>
> + 4, /* RT_NODE_KIND_4 */
>
> The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn't fail. Should it? Maybe we need symbols for the various fanouts.
>

Since this information is used to the number of keys inserted, it
doesn't check the node kind. So we just didn't test node-3. It might
be better to expose and use both RT_SIZE_CLASS and RT_SIZE_CLASS_INFO.

> I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstore never shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for future changes targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someone might have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinking logic, but needs some work because we need the (local pointer) parent to update to the new smaller node, just like the growing case.

Thanks, that's also on my todo list. TBH I'm not sure we should
improve the deletion at this stage as there is no use case of deletion
in the core. I'd prefer to focus on improving the quality of the
current radix tree and tidstore now, and I think we can support
node-shrinking once we are confident with the current implementation.

On Fri, Feb 17, 2023 at 5:00 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
>That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets that are assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets().

tidstore_set_block_offsets() sounds better. I used
TidStoreSetBlockOffsets() in the latest patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v29-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch application/octet-stream 48.1 KB
v29-0007-Review-radix-tree.patch application/octet-stream 22.5 KB
v29-0010-Revert-building-benchmark-module-for-CI.patch application/octet-stream 694 bytes
v29-0009-Review-vacuum-integration.patch application/octet-stream 12.6 KB
v29-0008-Review-TidStore.patch application/octet-stream 32.1 KB
v29-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch application/octet-stream 24.7 KB
v29-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch application/octet-stream 6.1 KB
v29-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch application/octet-stream 2.9 KB
v29-0003-Add-radixtree-template.patch application/octet-stream 117.0 KB
v29-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch application/octet-stream 36.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-02-20 06:40:30 RE: Is psSocketPoll doing the right thing?
Previous Message Amit Kapila 2023-02-20 05:37:42 Re: pg_upgrade and logical replication