Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-03-18 23:35:39
Message-ID: CANWCAZY014Nvdi4YZciJ7h0vXXH=6VkN66W-oiTqc0OziXzMEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:

> > Random offsets is what I was thinking of (if made distinct and
> > ordered), but even there the code is fairy trivial, so I don't have a
> > strong feeling about it.
>
> Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.
It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

> > I think we can stop including the debug-tid-store patch for CI now.
> > That would allow getting rid of some unnecessary variables.
>
> Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

> > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> > + * in one block. We return the block numbers in ascending order and the offset
> > + * numbers in each result is also sorted in ascending order.
> > + */
> > +TidStoreIterResult *
> > +TidStoreIterateNext(TidStoreIter *iter)
> >
> > The wording is a bit awkward.
>
> Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

> > +/* Extract TIDs from the given key-value pair */
> > +static void
> > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > BlocktableEntry *page)
> >
> > This is a leftover from the old encoding scheme. This should really
> > take a "BlockNumber blockno" not a "key", and the only call site
> > should probably cast the uint64 to BlockNumber.
>
> Fixed.

This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
return NULL;

/* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);

..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:

+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.

The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

> > +typedef struct BlocktableEntry
> > +{
> > + uint16 nwords;
> > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > +} BlocktableEntry;
> >
> > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.

I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

> > That doesn't have any real-world affect on the largest offset
> > encountered, and only in 32-bit builds with 32kB block size would the
> > theoretical max change at all. To be precise, we could use in the
> > MaxBlocktableEntrySize calculation:
> >
> > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
>
> I don't get this expression. Making the nwords one byte works well?
> With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> OS, respectively. One byte nwrods variable seems not to be sufficient

I believe there is confusion between bitmap words and bytes:
2048 / 64 = 32 words = 256 bytes

It used to be max tuples per (heap) page, but we wanted a simple way
to make this independent of heap. I believe we won't need to ever
store the actual MaxOffsetNumber, although we technically still could
with a one-byte type and 32kB pages, at least on 64-bit platforms.

> for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
> PG_INT8_MAX - 1" come from?

127 words, each with 64 (or 32) bits. The zero bit is not a valid
offset, so subtract one. And I used signed type in case there was a
need for -1 to mean something.

Attachment Content-Type Size
use-fewer-blocks.patch.nocfbot application/octet-stream 3.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2024-03-18 23:38:07 Re: Possibility to disable `ALTER SYSTEM`
Previous Message Robert Haas 2024-03-18 23:34:23 Re: documentation structure