Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-01-02 13:01:07
Message-ID: CAD21AoB1xPDVgmPho1Ngb-C8_a1exkyjUxkvFbkQGkxGFhvwgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 27, 2023 at 12:08 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
> > > + int num_offsets)
> > > +{
> > > + char buf[MaxBlocktableEntrySize];
> > > + BlocktableEntry *page = (BlocktableEntry *) buf;
> > >
> > > I'm not sure this is safe with alignment. Maybe rather than plain
> > > "char", it needs to be a union with BlocktableEntry, or something.
> >
> > I tried it in the new patch set but could you explain why it could not
> > be safe with alignment?
>
> I was thinking because "buf" is just an array of bytes. But, since the
> next declaration is a cast to a pointer to the actual type, maybe we
> can rely on the compiler to do the right thing. (It seems to on my
> machine in any case)

Okay, I kept it.

>
> > > About separation of responsibilities for locking: The only thing
> > > currently where the tid store is not locked is tree iteration. That's
> > > a strange exception. Also, we've recently made RT_FIND return a
> > > pointer, so the caller must somehow hold a share lock, but I think we
> > > haven't exposed callers the ability to do that, and we rely on the tid
> > > store lock for that. We have a mix of tree locking and tid store
> > > locking. We will need to consider carefully how to make this more
> > > clear, maintainable, and understandable.
> >
> > Yes, tidstore should be locked during the iteration.
> >
> > One simple direction about locking is that the radix tree has the lock
> > but no APIs hold/release it. It's the caller's responsibility. If a
> > data structure using a radix tree for its storage has its own lock
> > (like tidstore), it can use it instead of the radix tree's one. A
>
> It looks like the only reason tidstore has its own lock is because it
> has no way to delegate locking to the tree's lock. Instead of working
> around the limitations of the thing we've designed, let's make it work
> for the one use case we have. I think we need to expose RT_LOCK_*
> functions to the outside, and have tid store use them. That would
> allow us to simplify all those "if (TidStoreIsShared(ts)
> LWLockAcquire(..., ...)" calls, which are complex and often redundant.

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

>
> At some point, we'll probably want to keep locking inside, at least to
> smooth the way for fine-grained locking you mentioned.
>
> > > In a green field, it'd be fine to replace these with an expression of
> > > "num_offsets", but it adds a bit of noise for reviewers and the git
> > > log. Is it really necessary?
> >
> > I see your point. I think we can live with having both
> > has_lpdead_items and num_offsets. But we will have to check if these
> > values are consistent, which could be less maintainable.
>
> It would be clearer if that removal was split out into a separate patch.

Agreed.

>
> > > I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> > > moved to the PruneState. The latter was renamed in a way that makes
> > > more sense, but I don't see why the churn is necessary.
> ...
> > > I guess it was added here, 800 lines away? If so, why?
> >
> > The above changes are related. The idea is not to use tidstore in a
> > one-pass strategy. If the table doesn't have any indexes, in
> > lazy_scan_prune() we collect offset numbers of dead tuples on the page
> > and vacuum the page using them. In this case, we don't need to use
> > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
> > LVPagePruneState is a convenient place to store collected offset
> > numbers.
>
> Okay, that makes sense, but if it was ever explained, I don't
> remember, and there is nothing in the commit message either.
>
> I'm not sure this can be split up easily, but if so it might help reviewing.

Agreed.

>
> This change also leads to a weird-looking control flow:
>
> if (vacrel->nindexes == 0)
> {
> if (prunestate.num_offsets > 0)
> {
> ...
> }
> }
> else if (prunestate.num_offsets > 0)
> {
> ...
> }

Fixed.

I've attached a new patch set. From v47 patch, I've merged your
changes for radix tree, and split the vacuum integration patch into 3
patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
use a simple TID array for one-pass strategy (0008 patch), and replace
has_lpdead_items with "num_offsets > 0" (0009 patch), while
incorporating your review comments on the vacuum integration patch
(sorry for making it difficult to see the changes from v47 patch).
0013 to 0015 patches are also updates from v47 patch.

I'm thinking that we should change the order of the patches so that
tidstore patch requires the patch for changing DSA segment sizes. That
way, we can remove the complex max memory calculation part that we no
longer use from the tidstore patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v48-ART.tar.gz application/x-gzip 67.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-01-02 14:07:58 Re: Track in pg_replication_slots the reason why slots conflict?
Previous Message Sehrope Sarkuni 2024-01-02 12:23:31 Re: Password leakage avoidance