Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-03-19 03:23:47
Message-ID: CAD21AoAMoz7gLiYK6ZQ+pTW-sB1TbX9h_h=yxrTurWZ_3hFgnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> > > Random offsets is what I was thinking of (if made distinct and
> > > ordered), but even there the code is fairy trivial, so I don't have a
> > > strong feeling about it.
> >
> > Agreed.
>
> Looks good.
>
> A related thing I should mention is that the tests which look up all
> possible offsets are really expensive with the number of blocks we're
> using now (assert build):
>
> v70 0.33s
> v72 1.15s
> v73 1.32
>
> To trim that back, I think we should give up on using shared memory
> for the is-full test: We can cause aset to malloc a new block with a
> lot fewer entries. In the attached, this brings it back down to 0.43s.

Looks good. Agreed with this change.

> It might also be worth reducing the number of blocks in the random
> test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks : 228 ms

>
> > > I think we can stop including the debug-tid-store patch for CI now.
> > > That would allow getting rid of some unnecessary variables.
> >
> > Agreed.
>
> Okay, all that remains here is to get rid of those variables (might be
> just one).

Removed some unnecessary variables in 0002 patch.

>
> > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> > > + * in one block. We return the block numbers in ascending order and the offset
> > > + * numbers in each result is also sorted in ascending order.
> > > + */
> > > +TidStoreIterResult *
> > > +TidStoreIterateNext(TidStoreIter *iter)
> > >
> > > The wording is a bit awkward.
> >
> > Fixed.
>
> - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> - * in one block. We return the block numbers in ascending order and the offset
> - * numbers in each result is also sorted in ascending order.
> + * Scan the TidStore and return the TIDs of the next block. The returned block
> + * numbers is sorted in ascending order, and the offset numbers in each result
> + * is also sorted in ascending order.
>
> Better, but it's still not very clear. Maybe "The offsets in each
> iteration result are ordered, as are the block numbers over all
> iterations."

Thanks, fixed.

>
> > > +/* Extract TIDs from the given key-value pair */
> > > +static void
> > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > > BlocktableEntry *page)
> > >
> > > This is a leftover from the old encoding scheme. This should really
> > > take a "BlockNumber blockno" not a "key", and the only call site
> > > should probably cast the uint64 to BlockNumber.
> >
> > Fixed.
>
> This part looks good. I didn't notice earlier, but this comment has a
> similar issue
>
> @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
> return NULL;
>
> /* Collect TIDs extracted from the key-value pair */
> - tidstore_iter_extract_tids(iter, key, page);
> + tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
>
> ..."extracted" was once a separate operation. I think just removing
> that one word is enough to update it.

Fixed.

>
> Some other review on code comments:
>
> v73-0001:
>
> + /* Enlarge the TID array if necessary */
>
> It's "arrays" now.
>
> v73-0005:
>
> +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
> +-- different randon 100 offset numbers each other.
>
> The numbers are obvious from the query. Maybe just mention that the
> offsets are randomized and must be unique and ordered.
>
> + * The caller is responsible for release any locks.
>
> "releasing"

Fixed.

>
> > > +typedef struct BlocktableEntry
> > > +{
> > > + uint16 nwords;
> > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > > +} BlocktableEntry;
> > >
> > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
>
> I should be more clear here: nwords fitting into one byte allows 3
> embedded offsets (1 on 32-bit platforms, which is good for testing at
> least). With uint16 nwords that reduces to 2 (none on 32-bit
> platforms). Further, after the current patch series is fully
> committed, I plan to split the embedded-offset patch into two parts:
> The first would store the offsets in the header, but would still need
> a (smaller) allocation. The second would embed them in the child
> pointer. Only the second patch will care about the size of nwords
> because it needs to reserve a byte for the pointer tag.

Thank you for the clarification.

>
> > > That doesn't have any real-world affect on the largest offset
> > > encountered, and only in 32-bit builds with 32kB block size would the
> > > theoretical max change at all. To be precise, we could use in the
> > > MaxBlocktableEntrySize calculation:
> > >
> > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
> >
> > I don't get this expression. Making the nwords one byte works well?
> > With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> > bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> > OS, respectively. One byte nwrods variable seems not to be sufficient
>
> I believe there is confusion between bitmap words and bytes:
> 2048 / 64 = 32 words = 256 bytes

Oops, you're right.

>
> It used to be max tuples per (heap) page, but we wanted a simple way
> to make this independent of heap. I believe we won't need to ever
> store the actual MaxOffsetNumber, although we technically still could
> with a one-byte type and 32kB pages, at least on 64-bit platforms.
>
> > for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
> > PG_INT8_MAX - 1" come from?
>
> 127 words, each with 64 (or 32) bits. The zero bit is not a valid
> offset, so subtract one. And I used signed type in case there was a
> need for -1 to mean something.

Okay, I missed that we want to change nwords from uint8 to int8.

So the MaxBlocktableEntrySize calculation would be as follows?

#define MaxBlocktableEntrySize \
offsetof(BlocktableEntry, words) + \
(sizeof(bitmapword) * \
WORDS_PER_PAGE(Min(MaxOffsetNumber, \
BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))

I've made this change in the 0003 patch.

While reviewing the vacuum patch, I realized that we always pass
LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
to the tidstore is therefore always the same. I think it would be
better to make the caller of TidStoreCreate() specify the tranch_id
and pass it to RT_CREATE(). That way, the caller can specify their own
wait event for tidstore. The 0008 patch tried this idea. dshash.c does
the same idea.

Other patches are minor updates for tidstore and vacuum patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v74-ART.tar.gz application/gzip 28.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-03-19 03:38:57 Re: Built-in CTYPE provider
Previous Message Nathan Bossart 2024-03-19 03:16:01 Re: add AVX2 support to simd.h