Quick Links

Re: [PoC] Improve dead tuple storage for lazy vacuum

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PoC] Improve dead tuple storage for lazy vacuum
Date:	2023-08-14 11:05:21
Message-ID:	CAFBsxsFDhOSg4kGRLEbqqpdUANGHRREHd5EGZ+mDARe5wB5Zog@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

[Part 1 - clear the deck of earlier performance work etc]

Thanks for taking a look! I've merged 0007 and 0008. The others need a
performance test to justify them -- an eyeball check is not enough. I've
now made the time to do that.

==== sparse loads

v38 0001-0006 (still using node3 for this test only):

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
27.1000000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
165.6333333333333333

v38-0007-Optimize-RT_EXTEND_DOWN.patch

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
25.0900000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
157.3666666666666667

That seems worth doing.

v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch

This combines two things because I messed up a rebase: Use fanout of 4, and
try some macros for shmem sizes, both 32- and 64-bit. Looking at this much,
I no longer have a goal to have a separate set of size-classes for non-SIMD
platforms, because that would cause global maintenance problems -- it's
probably better to reduce worst-case search time where necessary. That
would be much more localized.

> I have some questions on 0009 patch:

> According to the comment, this optimization is for only gcc?

No, not at all. That tells me the comment is misleading.

> I think this change reduces
> readability and maintainability.

Well, that much is obvious. What is not obvious is how much it gains us
over the alternatives. I do have a simpler idea, though...

==== load mostly node4

select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
mem_allocated | load_ms | search_ms
---------------+---------+-----------
7352384 | 25 | 0

v38-0009-TEMP-take-out-search-time-from-bench.patch

This is just to allow LATERAL queries for better measurements.

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;

avg
---------------------
24.8333333333333333

v38-0010-Try-a-simpler-way-to-avoid-memmove.patch

This slightly rewrites the standard loop so that gcc doesn't turn it into a
memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs
a comment, which I forgot)

avg
---------------------
21.9600000000000000

So, that's not a trivial difference. I wasn't a big fan of Andres'
__asm("") workaround, but that may be just my ignorance about it. We need
something like either of the two.

v38-0011-Optimize-add_child_4-take-2.patch
avg
---------------------
21.3500000000000000

This is possibly faster than v38-0010, but looking like not worth the
complexity, assuming the other way avoids the bug going forward.

> According to the bugzilla ticket
> referred to in the comment, it's realized as a bug in the community,
> so once the gcc bug fixes, we might no longer need this trick, no?

No comment in two years...

v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch

This is the same as v37-0011. I wasn't quite satisfied with it since it
still has two memcpy() calls, but it actually seems to regress:

avg
---------------------
22.0900000000000000

v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

This patch uses a single loop for the copy.

avg
---------------------
21.0300000000000000

Within noise level of v38-0011, but it's small and simple, so I like it, at
least for small arrays.

v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch

Just small cleanups.

v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch

Makes sense, but untested.

===============
[Part 2]

Per off-list discussion with Masahiko, it makes sense to take some of the
ideas I've used locally on tidbitmap, and start incorporating them into
earlier vacuum work to get that out the door faster. With that in mind...

v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch

This uses a simplified PagetableEntry (unimaginatively called
BlocktableEntry just to avoid confusion), to be replaced with the real
thing at a later date. This is still fixed size, to be replaced with a
varlen type.

Looking at the tidstore tests again after some months, I'm not particularly
pleased with the amount of code required for how little it seems to be
testing, nor the output when something fails. (I wonder how hard it would
be to have SQL functions that add blocks/offsets to the tid store, and emit
tuples of tids found in the store.)

I'm also concerned about the number of places that have to know if the
store is using shared memory or not. Something to think about later.

v38-0018-Consolidate-inserting-updating-values.patch

This is something I coded up to get to an API more similar to one in
simplehash, as used in tidbitmap.c. It seem worth doing on its own to
reduce code duplication, and also simplifies coding of varlen types and
"runtime-embeddable values".

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v38-ART.tar.gz	application/gzip	56.7 KB

In response to

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2023-07-13 08:08:38 from Masahiko Sawada

Responses

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2023-08-15 02:33:38 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dagfinn Ilmari Mannsåker	2023-08-14 11:11:13	Re: Ignore 2PC transaction GIDs in query jumbling
Previous Message	Tatsuo Ishii	2023-08-14 10:45:21	Re: pgbench with libevent?