Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-03-15 14:16:57
Message-ID: CAD21AoAy7f=H4vbqtz=G1sUHWQhVOKr8Eqcn3fqeTR6AfLrXng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
> > > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > > general and put the validation down into C as well. First we save the
> > > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > > found values in another array. So the static items structure would
> > > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > > the iteration output is private to check_set_block_offsets(). Then
> > > > > sort as needed and check they are all the same.
> > > >
> > > > That's a promising idea. We can use the same mechanism for randomized
> > > > tests too. If you're going to work on this, I'll do other tests on my
> > > > environment in the meantime.
> > >
> > > Some progress on this in v72 -- I tried first without using SQL to
> > > save the blocks, just using the unique blocks from the verification
> > > array. It seems to work fine.
> >
> > Thanks!
>
> Seems I forgot the attachment last time...there's more stuff now
> anyway, based on discussion.

Thank you for updating the patches!

The idea of using three TID arrays for the lookup test and iteration
test looks good to me. I think we can add random-TIDs tests on top of
it.

>
> > > - Since there are now three arrays we should reduce max bytes to
> > > something smaller.
> >
> > Agreed.
>
> I went further than this, see below.
>
> > > - Further on that, I'm not sure if the "is full" test is telling us
> > > much. It seems we could make max bytes a static variable and set it to
> > > the size of the empty store. I'm guessing it wouldn't take much to add
> > > enough tids so that the contexts need to allocate some blocks, and
> > > then it would appear full and we can test that. I've made it so all
> > > arrays repalloc when needed, just in case.
> >
> > How about using work_mem as max_bytes instead of having it as a static
> > variable? In test_tidstore.sql we set work_mem before creating the
> > tidstore. It would make the tidstore more controllable by SQL queries.
>
> My complaint is that the "is full" test is trivial, and also strange
> in that max_bytes is used for two unrelated things:
>
> - the initial size of the verification arrays, which was always larger
> than necessary, and now there are three of them
> - the hint to TidStoreCreate to calculate its max block size / the
> threshold for being "full"
>
> To make the "is_full" test slightly less trivial, my idea is to save
> the empty store size and later add enough tids so that it has to
> allocate new blocks/DSA segments, which is not that many, and then it
> will appear full. I've done this and also separated the purpose of
> various sizes in v72-0009/10.

I see your point and the changes look good to me.

> Using actual work_mem seems a bit more difficult to make this work.

Agreed.

>
>
> > ---
> > + if (TidStoreIsShared(ts))
> > + found = shared_rt_set(ts->tree.shared, blkno, page);
> > + else
> > + found = local_rt_set(ts->tree.local, blkno, page);
> > +
> > + Assert(!found);
> >
> > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > overwrite) the value, I think we should not expect that found is
> > always false.
>
> I find that a puzzling statement, since 1) it was designed for
> insert-only workloads, not actual overwrite IIRC and 2) the tests will
> now fail if the same block is set twice, since we just switched the
> tests to use a remnant of vacuum's old array. Having said that, I
> don't object to removing artificial barriers to using it for purposes
> not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

>
> Given the above two things, I think this function's comment needs
> stronger language about its limitations. Perhaps even mention that
> it's intended for, and optimized for, vacuum. You and I have long
> known that tidstore would need a separate, more complex, function to
> add or remove individual tids from existing entries, but it might be
> good to have that documented.

Agreed.

>
> Other things:
>
> v72-0011: Test that zero offset raises an error.
>
> v72-0013: I had wanted to microbenchmark this, but since we are
> running short of time I decided to skip that, so I want to revert some
> code to make it again more similar to the equivalent in tidbitmap.c.
> In the absence of evidence, it seems better to do it this way.

LGTM.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2024-03-15 14:20:25 Re: [HACKERS] make async slave to wait for lsn to be replayed
Previous Message Nathan Bossart 2024-03-15 14:16:07 Re: recovery modules