Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-07 09:25:44
Message-ID: CAFBsxsHbBm_M22gLBO+AZT4mfMq3L_oX3wdKZxjeNnT7fHsYMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> I've attached v24 patches. The locking support patch is separated
> (0005 patch). Also I kept the updates for TidStore and the vacuum
> integration from v23 separate.

Okay, that's a lot more simple, and closer to what I imagined. For v25, I
squashed v24's additions and added a couple of my own. I've kept the CF
status at "needs review" because no specific action is required at the
moment.

I did start to review the TID store some more, but that's on hold because
something else came up: On a lark I decided to re-run some benchmarks to
see if anything got lost in converting to a template, and that led me down
a rabbit hole -- some good and bad news on that below.

0001:

I removed the uint64 case, as discussed. There is now a brief commit
message, but needs to be fleshed out a bit. I took another look at the Arm
optimization that Nathan found some month ago, for forming the highbit
mask, but that doesn't play nicely with how node32 uses it, so I decided
against it. I added a comment to describe the reasoning in case someone
else gets a similar idea.

I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to
their own header to clean up the #ifdef maze.", but decided it wasn't such
a clear win to justify starting the work now. It's still in the back of my
mind, but I removed the reminder from the commit message.

0003:

The template now requires the value to be passed as a pointer. That was a
pretty trivial change, but affected multiple other patches, so not sent
separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds
a top comment to the *_impl.h headers. There are some comment fixes. The
changes were either trivial or discussed earlier, so also not sent
separately.

0004/5: I wanted to measure the load time as well as search time in
bench_search_random_nodes(). That's kept separate to make it easier to test
other patch versions.

The bad news is that the speed of loading TIDs in
bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this
in any other bench function and was the reason for writing 0005 to begin
with. More confusingly, my efforts to fix this improved *other* functions,
but the former didn't budge at all. First the patches:

0006 adds and removes some "inline" declarations (where it made sense), and
added some for "pg_noinline" based on Andres' advice some months ago.

0007 removes some dead code. RT_NODE_INSERT_INNER is only called during
RT_SET_EXTEND, so it can't possibly find an existing key. This kind of
change is much easier with the inner/node cases handled together in a
template, as far as being sure of how those cases are different. I thought
about trying the search in assert builds and verifying it doesn't exist,
but thought yet another #ifdef would be too messy.

v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key
chunks in order for the linear-search nodes. I believe the TID store no
longer cares about the ordering, but this is a text file for now because I
don't want to clutter the CI with a behavior change. Also, the second ART
paper (on concurrency) mentioned that some locking schemes don't allow
these arrays to be shifted. So it might make sense to give up entirely on
guaranteeing ordered iteration, or at least make it optional as in the
patch.

Now for some numbers:

========================================
psql -c "select * from bench_search_random_nodes(10*1000*1000)"
(min load time of three)

v15:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
334182184 | 3352 | 2073

v25-0005:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3426 | 2126

v25-0006 (inlining or not):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3327 | 2035

v25-0007 (remove dead code):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3313 | 2037

v25-addendum...txt (no ordering):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 2762 | 2042

Allowing unordered inserts helps a lot here in loading. That's expected
because there are a lot of inserts into the linear nodes. 0006 might help a
little.

========================================
psql -c "select avg(load_ms) from generate_series(1,30) x(x), lateral
(select * from bench_load_random_int(500 * 1000 * (1+x-x))) a"

v15:
avg
----------------------
207.3000000000000000

v25-0005:
avg
----------------------
190.6000000000000000

v25-0006 (inlining or not):
avg
----------------------
189.3333333333333333

v25-0007 (remove dead code):
avg
----------------------
186.4666666666666667

v25-addendum...txt (no ordering):
avg
----------------------
179.7000000000000000

Most of the improvement from v15 to v25 probably comes from the change from
node4 to node3, and this test stresses that node the most. That shows in
the total memory used: it goes from 152MB to 132MB. Allowing unordered
inserts helps some, the others are not convincing.

========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000
* 1000)"
(min load time of three)

v15:
rt_load_ms | rt_search_ms
------------+--------------
113 | 455

v25-0005:
rt_load_ms | rt_search_ms
------------+--------------
135 | 456

v25-0006 (inlining or not):
rt_load_ms | rt_search_ms
------------+--------------
136 | 455

v25-0007 (remove dead code):
rt_load_ms | rt_search_ms
------------+--------------
135 | 455

v25-addendum...txt (no ordering):
rt_load_ms | rt_search_ms
------------+--------------
134 | 455

Note: The regression seems to have started in v17, which is the first with
a full template.

Nothing so far has helped here, and previous experience has shown that
trying to profile 100ms will not be useful. Instead of putting more effort
into diving deeper, it seems a better use of time to write a benchmark that
calls the tid store itself. That's more realistic, since this function was
intended to test load and search of tids, but the tid store doesn't quite
operate so simply anymore. What do you think, Masahiko?

I'm inclined to keep 0006, because it might give a slight boost, and 0007
because it's never a bad idea to remove dead code.
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v25-addendum-try-no-maintain-order.txt text/plain 1.3 KB
v25-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch text/x-patch 6.1 KB
v25-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch text/x-patch 2.9 KB
v25-0005-Measure-load-time-of-bench_search_random_nodes.patch text/x-patch 2.4 KB
v25-0004-Tool-for-measuring-radix-tree-performance.patch text/x-patch 22.0 KB
v25-0003-Add-radixtree-template.patch text/x-patch 116.9 KB
v25-0006-Adjust-some-inlining-declarations.patch text/x-patch 2.1 KB
v25-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patch text/x-patch 4.7 KB
v25-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch text/x-patch 35.2 KB
v25-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch text/x-patch 48.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-02-07 09:56:09 Re: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message chandan kunal 2023-02-07 09:17:07 Regarding TPCC benchmarking of postgresql for streaming