Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-01-16 05:01:57
Message-ID: CAFBsxsEPQkCr50co+EA-+bvQNaVv-Cr99JxT-CMGBBqs0bGftQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > Okay, I'll squash the previous patch and work on cleaning up the
internals. I'll keep the external APIs the same so that your work on vacuum
integration can be easily rebased on top of that, and we can work
independently.

There were some conflicts with HEAD, so to keep the CF bot busy, I've
quickly put together v18. I still have a lot of cleanup work to do, but
this is enough for now.

0003 contains all v17 local-memory coding squashed together.

0004 perf test not updated but it doesn't build by default so it's fine for
now

0005 removes node.chunk as discussed, but does not change node4 fanout yet.

0006 is a small cleanup regarding setting node fanout.

0007 squashes my shared memory work with Masahiko's fixes from the addendum
v17-0010.

0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as
discussed.

0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the
latter rebased over recent variable renaming (it's possible I missed
something, so worth checking).

> I've implemented the idea of using union. Let me share WIP code for
> discussion, I've attached three patches that can be applied on top of

Seems fine as far as the union goes. Let's go ahead with this, and make
progress on locking etc.

> Overall, TidStore implementation with the union idea doesn't look so
> ugly to me. But I got many compiler warning about unused radix tree
> functions like:
>
> tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> [-Wunused-function]
>
> I'm not sure there is a convenient way to suppress this warning but
> one idea is to have some macros to specify what operations are
> enabled/declared.

That sounds like a good idea. It's also worth wondering if we even need
RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that
if necessary. It's also misnamed, since it's concerned with the number of
keys. The vacuum case cares about the number of TIDs, and not number of
(encoded) keys. Even if we ever (say) changed the key to blocknumber and
value to Bitmapset, the number of keys might not be interesting. It sounds
like we should at least make the delete functionality optional. (Side note
on optional functions: if an implementation didn't care about iteration or
its order, we could optimize insertion into linear nodes)

Since this is WIP, you may already have some polish in mind, so I won't go
over the patches in detail, but I wanted to ask about a few things (numbers
referring to v17 addendum, not v18):

0011

+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the
maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;

uint32 is how we store the block number, so this too small and will wrap
around on overflow. int64 seems better.

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */

Maybe the #define and comment should be close to here.

+ * Destroy a TidStore, returning all memory. The caller must be certain
that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that
calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

If not addressed by next patch, need to phrase comment with FIXME or TODO
about making certain.

+ * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

+ uint64 last_key = PG_UINT64_MAX;

I'm having some difficulty understanding this sentinel and how it's used.

@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;

- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);

/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);

This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded
complexity. It arises because lazy_scan_prune() populates the tid store
even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune()
could pass the deadoffsets array, and upon returning, either populate the
store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm
missing some detail, so some description of the design choices made would
be helpful.

On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> I've written a simple script to simulate the DSA memory usage and the
> limit. The 75% limit works fine for a power of two cases, and we can
> use the 60% limit for other cases (it seems we can use up to about 66%
> but used 60% for safety). It would be best if we can mathematically
> prove it but I could prove only the power of two cases. But the script
> practically shows the 60% threshold would work for these cases.

Okay. It's worth highlighting this in the comments, and also the fact that
it depends on internal details of how DSA increases segment size.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v18-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch application/x-patch 6.1 KB
v18-0001-introduce-vector8_min-and-vector8_highbit_mask.patch application/x-patch 2.6 KB
v18-0005-Remove-chunk-from-the-common-node-type.patch application/x-patch 2.8 KB
v18-0004-tool-for-measuring-radix-tree-performance.patch application/x-patch 20.0 KB
v18-0003-Add-radixtree-template.patch application/x-patch 97.2 KB
v18-0006-Clarify-coding-around-fanout.patch application/x-patch 2.3 KB
v18-0009-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch application/x-patch 23.9 KB
v18-0008-Turn-branch-into-Assert-in-RT_NODE_UPDATE_INNER.patch application/x-patch 3.9 KB
v18-0007-Implement-shared-memory.patch application/x-patch 40.3 KB
v18-0010-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch application/x-patch 34.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-01-16 05:12:56 Re: New strategies for freezing, advancing relfrozenxid early
Previous Message Peter Smith 2023-01-16 04:54:00 Re: Perform streaming logical transactions by background workers and parallel apply