Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-08-15 02:33:38
Message-ID: CAD21AoB8RY5aM_+PzWPEb5M=MZ9J4aVUH+xCj_D8uUx4JWcZDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

Thank you for updating the patch!

>
> [Part 1 - clear the deck of earlier performance work etc]
>
> Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball check is not enough. I've now made the time to do that.
>
> ==== sparse loads
>
> v38 0001-0006 (still using node3 for this test only):
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
> avg
> ---------------------
> 27.1000000000000000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
> avg
> ----------------------
> 165.6333333333333333
>
> v38-0007-Optimize-RT_EXTEND_DOWN.patch
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
> avg
> ---------------------
> 25.0900000000000000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
> avg
> ----------------------
> 157.3666666666666667
>
> That seems worth doing.
>
> v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch
>
> This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, because that would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary. That would be much more localized.
>
> > I have some questions on 0009 patch:
>
> > According to the comment, this optimization is for only gcc?
>
> No, not at all. That tells me the comment is misleading.
>
> > I think this change reduces
> > readability and maintainability.
>
> Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea, though...
>
> ==== load mostly node4
>
> select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
> n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
> mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
> 7352384 | 25 | 0
>
> v38-0009-TEMP-take-out-search-time-from-bench.patch
>
> This is just to allow LATERAL queries for better measurements.
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;
>
> avg
> ---------------------
> 24.8333333333333333

0007, 0008, and 0009 look good to me.

>
> v38-0010-Try-a-simpler-way-to-avoid-memmove.patch
>
> This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a comment, which I forgot)
>
> avg
> ---------------------
> 21.9600000000000000
>
> So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignorance about it. We need something like either of the two.
>
> v38-0011-Optimize-add_child_4-take-2.patch
> avg
> ---------------------
> 21.3500000000000000
>
> This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.

I prefer 0010 but is it worth testing with other compilers such as clang?

>
> > According to the bugzilla ticket
> > referred to in the comment, it's realized as a bug in the community,
> > so once the gcc bug fixes, we might no longer need this trick, no?
>
> No comment in two years...
>
> v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch
>
> This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seems to regress:
>
> avg
> ---------------------
> 22.0900000000000000
>
> v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
>
> This patch uses a single loop for the copy.
>
> avg
> ---------------------
> 21.0300000000000000
>
> Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.

Agreed.

>
> v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
> v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch
>
> Just small cleanups.
>
> v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch
>
> Makes sense, but untested.

Agreed.

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

>
> ===============
> [Part 2]
>
> Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and start incorporating them into earlier vacuum work to get that out the door faster. With that in mind...
>
> v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch
>
> This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replaced with the real thing at a later date. This is still fixed size, to be replaced with a varlen type.

That's more readable.

>
> Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.)

It would not be hard to have such SQL functions. I'll try it.

>
> I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something to think about later.
>
> v38-0018-Consolidate-inserting-updating-values.patch
>
> This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable values".

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
fix_oom.patch application/octet-stream 661 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2023-08-15 02:37:37 Re: Do we want a hashset type?
Previous Message Xing Guo 2023-08-15 02:21:44 Re: [RFC] Clang plugin for catching suspicious typedef casting