Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-27 03:07:48
Message-ID: CANWCAZYSwqSaYVVqdo1oSM_qX+PtTYSWhiJ+SDg02VgAB5DR_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:

> > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
> > + int num_offsets)
> > +{
> > + char buf[MaxBlocktableEntrySize];
> > + BlocktableEntry *page = (BlocktableEntry *) buf;
> >
> > I'm not sure this is safe with alignment. Maybe rather than plain
> > "char", it needs to be a union with BlocktableEntry, or something.
>
> I tried it in the new patch set but could you explain why it could not
> be safe with alignment?

I was thinking because "buf" is just an array of bytes. But, since the
next declaration is a cast to a pointer to the actual type, maybe we
can rely on the compiler to do the right thing. (It seems to on my
machine in any case)

> > About separation of responsibilities for locking: The only thing
> > currently where the tid store is not locked is tree iteration. That's
> > a strange exception. Also, we've recently made RT_FIND return a
> > pointer, so the caller must somehow hold a share lock, but I think we
> > haven't exposed callers the ability to do that, and we rely on the tid
> > store lock for that. We have a mix of tree locking and tid store
> > locking. We will need to consider carefully how to make this more
> > clear, maintainable, and understandable.
>
> Yes, tidstore should be locked during the iteration.
>
> One simple direction about locking is that the radix tree has the lock
> but no APIs hold/release it. It's the caller's responsibility. If a
> data structure using a radix tree for its storage has its own lock
> (like tidstore), it can use it instead of the radix tree's one. A

It looks like the only reason tidstore has its own lock is because it
has no way to delegate locking to the tree's lock. Instead of working
around the limitations of the thing we've designed, let's make it work
for the one use case we have. I think we need to expose RT_LOCK_*
functions to the outside, and have tid store use them. That would
allow us to simplify all those "if (TidStoreIsShared(ts)
LWLockAcquire(..., ...)" calls, which are complex and often redundant.

At some point, we'll probably want to keep locking inside, at least to
smooth the way for fine-grained locking you mentioned.

> > In a green field, it'd be fine to replace these with an expression of
> > "num_offsets", but it adds a bit of noise for reviewers and the git
> > log. Is it really necessary?
>
> I see your point. I think we can live with having both
> has_lpdead_items and num_offsets. But we will have to check if these
> values are consistent, which could be less maintainable.

It would be clearer if that removal was split out into a separate patch.

> > I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> > moved to the PruneState. The latter was renamed in a way that makes
> > more sense, but I don't see why the churn is necessary.
...
> > I guess it was added here, 800 lines away? If so, why?
>
> The above changes are related. The idea is not to use tidstore in a
> one-pass strategy. If the table doesn't have any indexes, in
> lazy_scan_prune() we collect offset numbers of dead tuples on the page
> and vacuum the page using them. In this case, we don't need to use
> tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
> LVPagePruneState is a convenient place to store collected offset
> numbers.

Okay, that makes sense, but if it was ever explained, I don't
remember, and there is nothing in the commit message either.

I'm not sure this can be split up easily, but if so it might help reviewing.

This change also leads to a weird-looking control flow:

if (vacrel->nindexes == 0)
{
if (prunestate.num_offsets > 0)
{
...
}
}
else if (prunestate.num_offsets > 0)
{
...
}

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-12-27 03:23:24 Re: POC: GROUP BY optimization
Previous Message Alexander Korotkov 2023-12-27 02:48:08 Re: POC: GROUP BY optimization