Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-01-18 04:30:01
Message-ID: CANWCAZZ2p1u4rBmxeYx9dBRrS8=cn7_X086hc+uqNM82jdPrAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> It seems we cannot make this work nicely. IIUC VacDeadItems is
> allocated in DSM and TidStore is embedded there. However,
> dead_items->items.area is a local pointer to dsa_area. So we cannot
> include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> would need to pass dsa_area to each interface by callers.

Thanks again for exploring this line of thinking! Okay, it seems even
if there's a way to make this work, it would be too invasive to
justify when compared with the advantage I was hoping for.

> > If we can't make this work nicely, I'd be okay with keeping the tid
> > store control object. My biggest concern is unnecessary
> > double-locking.
>
> If we don't do any locking stuff in radix tree APIs and it's the
> user's responsibility at all, probably we don't need a lock for
> tidstore? That is, we expose lock functions as you mentioned and the
> user (like tidstore) acquires/releases the lock before/after accessing
> the radix tree and num_items.

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

There are a number of places where we assert "the running count of the
dead items" is the same as "the length of the dead items array", like
here:

@@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;

Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));

As such, in HEAD I'm guessing it's arbitrary which one is used for
control flow. Correct me if I'm mistaken. If I am wrong for some part
of the code, it'd be good to understand when that invariant can't be
maintained.

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);

Like here. In HEAD, could this have used vacrel->dead_items?

@@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass. No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
- (index == vacrel->lpdead_items &&
+ (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));

ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));

We assert that vacrel->lpdead_items has the expected value, and then
the ereport repeats the function call (with a lock) to read the value
we just consulted to pass the assert.

If we *really* want to compare counts, maybe we could invent a
debugging-only function that iterates over the tree and popcounts the
bitmaps. That seems too expensive for regular assert builds, though.

On the subject of debugging builds, I think it no longer makes sense
to have the array for debug checking in tid store, even during
development. A few months ago, we had an encoding scheme that looked
simple on paper, but its code was fiendishly difficult to follow (at
least for me). That's gone. In addition to the debugging count above,
we could also put a copy of the key in the BlockTableEntry's header,
in debug builds. We don't yet need to care about the key size, since
we don't (yet) have runtime-embeddable values.

> Currently (as of v52 patch) RT_FIND is
> doing so,

[meaning, there is no internal "automatic" locking here since after we
switched to variable-length types, an outstanding TODO]
Maybe it's okay to expose global locking for v17. I have one possible
alternative:

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

I believe this is the only open question remaining. The rest is just
polish and testing.

> During trying this idea, I realized that there is a visibility problem
> in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this configuration has ever been tested). I think a good test
is putting the shared tid tree in it's own translation unit, to see if
anything needs to be fixed. I'll go try that.

Attachment Content-Type Size
rt-find-callback-interface.patch.nocfbot application/octet-stream 8.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2024-01-18 04:49:44 Re: initdb's -c option behaves wrong way?
Previous Message David Rowley 2024-01-18 04:08:33 Re: Strange Bitmapset manipulation in DiscreteKnapsack()