Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-01-22 07:23:43
Message-ID: CAD21AoBpMbEjqbJCvk5iQcAWeZFhr3afTO1Hz6WvCuv_GtrYTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 22, 2024 at 2:36 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On further thought, as you pointed out before, "num_tids" should not
> > be in tidstore in terms of integration with tidbitmap.c, because
> > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
> > longer accurate and useful. Similarly, looking at tidbitmap.c, it has
> > npages and nchunks but they will not be necessary in lazy vacuum use
> > case. Also, assuming that we support parallel heap pruning, probably
> > we need to somehow lock the tidstore while adding tids to the tidstore
> > concurrently by parallel vacuum worker. But in tidbitmap use case, we
> > don't need to lock the tidstore since it doesn't have multiple
> > writers.
>
> Not currently, and it does seem bad to require locking where it's not required.
>
> (That would be a prerequisite for parallel index scan. It's been tried
> before with the hash table, but concurrency didn't scale well with the
> hash table. I have no reason to think that the radix tree would scale
> significantly better with the same global LW lock, but as you know
> there are other locking schemes possible.)
>
> > Given these facts, different statistics and different lock
> > strategies are required by different use case. So I think there are 3
> > options:
> >
> > 1. expose lock functions for tidstore and the caller manages the
> > statistics in the outside of tidstore. For example, in lazyvacuum.c we
> > would have a TidStore for tid storage as well as VacDeadItemsInfo that
> > has num_tids and max_bytes. Both are in LVRelState. For parallel
> > vacuum, we pass both to the workers via DSM and pass both to function
> > where the statistics are required. As for the exposed lock functions,
> > when adding tids to the tidstore, the caller would need to call
> > something like TidStoreLockExclusive(ts) that further calls
> > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.
>
> The advantage here is that vacuum can avoid locking entirely while
> using shared memory, just like it does now, and has the option to add
> it later.

True.

> IIUC, the radix tree struct would have a lock member, but wouldn't
> take any locks internally? Maybe we still need one for
> RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
> DSA_AREA_LOCK -- maybe that's enough?

I think that's a good point. So there will be no place where the radix
tree takes any locks internally.

>
> That seems simplest, and is not very far from what we do now. If we do
> this, then the lock functions should be where we branch for is_shared.

Agreed.

>
> > 2. add callback functions to tidstore so that the caller can do its
> > work while holding a lock on the tidstore. This is like the idea we
> > just discussed for radix tree. The caller passes a callback function
> > and user data to TidStoreSetBlockOffsets(), and the callback is called
> > after setting tids. Similar to option 1, the statistics need to be
> > stored in a different area.
>
> I think we'll have to move to something like this eventually, but it
> seems like overkill right now.

Right.

>
> > 3. keep tidstore.c and tidbitmap.c separate implementations but use
> > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
> > control object and doesn't have any lossy page support. On the other
> > hand, in tidbitmap.c we replace simplehash with radix tree. This makes
> > tidstore.c simple but we would end up having different data structures
> > for similar usage.
>
> They have so much in common that it's worth it to use the same
> interface and (eventually) value type. They just need separate paths
> for adding tids, as we've discussed.

Agreed.

>
> > I think it's worth trying option 1. What do you think, John?
>
> +1

Thanks!

Before working on this idea, since the latest patches conflict with
the current HEAD, I share the latest patch set (v54). Here is the
summary:

- As for radix tree part, it's based on v53 patch. I've squashed most
of cleanups and changes in v53 except for "DRAFT: Stop using invalid
pointers as placeholders." as I thought you might want to still work
on it. BTW it includes "#undef RT_SHMEM".
- As for tidstore, it's based on v51. That is, it still has the
control object and num_tids there.
- As for vacuum integration, it's also based on v51. But we no longer
need to change has_lpdead_items and LVPagePruneState thanks to the
recent commit c120550edb8 and e313a61137.

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v54-ART.tar.gz application/x-gzip 57.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-01-22 07:31:59 make dist using git archive
Previous Message Amul Sul 2024-01-22 07:21:25 Re: Add system identifier to backup manifest