Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-31 14:42:32
Message-ID: CAD21AoCSkEDYAqW3+JtSQQY+R2tDCrXA=PgNAUXUdUQ5gBOp0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 30, 2023 at 11:30 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Jan 30, 2023 at 1:31 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Sat, Jan 28, 2023 at 8:33 PM John Naylor
> > > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > > > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
> > >
> > > Yes, but if a concurrent writer waits for another process to finish
> > > the iteration, it ends up waiting on a lwlock, which is not
> > > interruptible.
> > >
> > > >
> > > > > So the idea is that we set iter_active to true (with the
> > > > > lock in exclusive mode), and prevent concurrent updates when the flag
> > > > > is true.
> > > >
> > > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
> > >
> > > Right. I think if we want to wait rather than an ERROR, the waiter
> > > should wait in an interruptible way, for example, a condition
> > > variable. I did a simpler way in the v22 patch.
> > >
> > > ...but looking at dshash.c, dshash_seq_next() seems to return an entry
> > > while holding a lwlock on the partition. My assumption might be wrong.
> >
> > Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
> >
> > If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.
>
> Agreed. Will update the comments.
>
> >
> > > > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
> > >
> > > Seems a good idea. Given the use case for parallel heap vacuum, it
> > > would be a good idea to support having multiple read-only writers. The
> > > iteration of the v22 is read-only, so if we want to support read-write
> > > iterator, we would need to support a function that modifies the
> > > current key-value returned by the iteration.
> >
> > Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:
> >
> > 1) parallel heap vacuum -> multiple read-only iterators
> > 2) parallel heap pruning -> multiple writers
> >
> > It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.
>
> True. Using a larger batching method seems to be worth testing when we
> implement the parallel heap pruning.
>
> In the next version patch, I'm going to update the locking support
> part and incorporate other comments I got.
>

I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v24-0005-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patch application/octet-stream 11.0 KB
v24-0008-Update-TidStore-patch-from-v23.patch application/octet-stream 10.8 KB
v24-0009-Update-vacuum-integration-patch-from-v23.patch application/octet-stream 10.0 KB
v24-0007-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch application/octet-stream 40.1 KB
v24-0006-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch application/octet-stream 34.4 KB
v24-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch application/octet-stream 6.1 KB
v24-0001-introduce-vector8_min-and-vector8_highbit_mask.patch application/octet-stream 2.7 KB
v24-0003-Add-radixtree-template.patch application/octet-stream 113.1 KB
v24-0004-Tool-for-measuring-radix-tree-performance.patch application/octet-stream 22.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-31 14:45:56 Re: Assert fcinfo has enough args before allowing parameter access (was: Re: generate_series for timestamptz and time zone problem)
Previous Message Amit Langote 2023-01-31 14:07:20 Re: Speed up transaction completion faster after many relations are accessed in a transaction