Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-29 14:49:56
Message-ID: CAD21AoBjmj0fDyTJCYvs3W2GQb9bUT5pwQLOU_=BUTOWhj9ToQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> > I think that we need to prevent concurrent updates (RT_SET() and
> > RT_DELETE()) during the iteration to get the consistent result through
> > the whole iteration operation. Unlike other operations such as
> > RT_SET(), we cannot expect that a job doing something for each
> > key-value pair in the radix tree completes in a short time, so we
> > cannot keep holding the radix tree lock until the end of the
> > iteration.
>
> This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we should worry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-write workloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme, possibly using one of the ideas in the last ART paper you mentioned.
>
> The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

>
> > So the idea is that we set iter_active to true (with the
> > lock in exclusive mode), and prevent concurrent updates when the flag
> > is true.
>
> ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

>
> > > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
> >
> > The lock I'm thinking of adding is a simple readers-writer lock. This
> > lock is used for concurrent radix tree operations except for the
> > iteration. For operations concurrent to the iteration, I used a flag
> > for the reason I mentioned above.
>
> This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only have a vague idea about the tradeoffs made regarding iteration.
>
> + * WIP: describe about how locking works.
>
> A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no comments and a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until I have a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested.
>
> [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

>
> > > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
> >
> > Using the isolation tester to test locking seems like a good idea. We
> > can include it in test_radixtree. But given that the locking in the
> > radix tree is very simple, the test case would be very simple. It may
> > be controversial whether it's worth adding such testing by adding both
> > the new test module and test cases.
>
> I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing.

Okay, understood.

>
> > I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> > number to the parallel vacuum.
>
> Okay, sounds good.
>
> Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely. There is one exception:
>
> 0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. It turns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable and just writes to the passed pointer. This required callers to initialize a couple of their own variables, but only child pointers, at least on gcc 12.

Agreed with the attached patch.

> And I will work later on making "value" in the public API a pointer.

Thanks!

>
> 0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
>
> + if (TidStoreIsShared(ts))
> + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
> + else
> + return sizeof(TidStore) + sizeof(TidStore) +
> + local_rt_memory_usage(ts->tree.local);
>
> There is repetition in the else branch.

Agreed, will remove.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-01-29 15:45:22 Re: Assertion failure in SnapBuildInitialSnapshot()
Previous Message Dean Rasheed 2023-01-29 13:33:22 Re: [PATCH] Fix old thinko in formula to compute sweight in numeric_sqrt().