Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-22 06:15:30
Message-ID: CAD21AoDCuTO6_wKF9mPm2hb9Fo-dc9kwbec6iXpwptsYaoFdyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Feb 16, 2023 at 6:23 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> > > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > > > > I can think that something like traversing a HOT chain could visit
> > > > > offsets out of order. But fortunately we prune such collected TIDs
> > > > > before heap vacuum in heap case.
> > > >
> > > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
> > >
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.
> >
> > > We have the following WIP comment in test_radixtree:
> > >
> > > // WIP: compiles with warnings because rt_attach is defined but not used
> > > // #define RT_SHMEM
> > >
> > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> > > and friends?
> >
> > Sounds good to me, and the other fixes make sense as well.
>
> Thanks, I merged them.
>
> >
> > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> > > seems to work fine.
> >
> > That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)
>
> According to the doc, the minimum block size is 1kB. It seems to work
> fine with 1kB blocks.
>
> >
> > > You removed the vacuum integration patch from v27, is there any reason for that?
> >
> > Just an oversight.
> >
> > Now for some general comments on the tid store...
> >
> > + * TODO: The caller must be certain that no other backend will attempt to
> > + * access the TidStore before calling this function. Other backend must
> > + * explicitly call tidstore_detach to free up backend-local memory associated
> > + * with the TidStore. The backend that calls tidstore_destroy must not call
> > + * tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > Do we need to do anything for this todo?
>
> Since it's practically no problem, I think we can live with it for
> now. dshash also has the same todo.
>
> >
> > It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.
> >
> > The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".
> >
> > I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
> >
> > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
> >
> > It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?
> >
> > The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.
> >
> > Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.
> >
> > I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.
>
> The attached 0008 patch addressed all above comments on tidstore.
>
> > Some comments on vacuum:
> >
> > I think we'd better get some real-world testing of this, fairly soon.
> >
> > I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed.
>
> Yeah, I did a similar thing in an earlier version of tidstore patch.
> Since we're trying to introduce two new components: radix tree and
> tidstore, I sometimes find it hard to investigate failures happening
> during lazy (parallel) vacuum due to a bug either in tidstore or radix
> tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> stores tids to both the radix tree and array, and checks if the
> results match when lookup or iteration. It will use more memory but it
> would not be a big problem in USE_ASSERT_CHECKING builds. It would
> also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v29-0011-Debug-TIDStore.patch.txt text/plain 9.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-02-22 06:19:38 Re: "out of relcache_callback_list slots" after multiple calls to pg_logical_slot_get_binary_changes
Previous Message Michael Paquier 2023-02-22 05:32:50 Re: psql memory leaks