Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-06-28 06:24:11
Message-ID: CAD21AoAuEPyK2Jf9ScZNd-vGzrqcyx5S_g-jGL__3CsykNJV_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Jun 27, 2022 at 8:12 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> [v3 patch]
>
> Hi Masahiko,
>
> Since there are new files, and they are pretty large, I've attached
> most specific review comments and questions as a diff rather than in
> the email body. This is not a full review, which will take more time
> -- this is a first pass mostly to aid my understanding, and discuss
> some of the design and performance implications.
>
> I tend to think it's a good idea to avoid most cosmetic review until
> it's close to commit, but I did mention a couple things that might
> enhance readability during review.

Thank you for reviewing the patch!

>
> As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:
>
> > On Thu, Jun 16, 2022 at 4:30 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > >
> > > For now, though, I'd like to question
> > > why we even need to use 32-byte registers in the first place. For one,
> > > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > > is 48 and uses a different method to find the index of the next
> > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > > of his patch a couple weeks ago I don't recall a reason mentioned for
> > > it.
> >
> > I might be wrong but since AVX2 instruction set is introduced in
> > Haswell microarchitecture in 2013 and the referenced paper is
> > published in the same year, the art didn't use AVX2 instruction set.
>
> Sure, but with a bit of work the same technique could be done on that
> node size with two 16-byte registers.
>
> > 32-pointer nodes are better from a memory perspective as you
> > mentioned. Andres' prototype supports both 16-pointer nodes and
> > 32-pointer nodes (out of 6 node types). This would provide better
> > memory usage but on the other hand, it would also bring overhead of
> > switching the node type.
>
> Right, using more node types provides smaller increments of node size.
> Just changing node type can be better or worse, depending on the
> input.
>
> > Anyway, it's an important design decision to
> > support which size of node to support. It should be done based on
> > experiment results and documented.
>
> Agreed. I would add that in the first step, we want something
> straightforward to read and easy to integrate into our codebase.

Agreed.

> I
> suspect other optimizations would be worth a lot more than using AVX2:
> - collapsing inner nodes
> - taking care when constructing the key (more on this when we
> integrate with VACUUM)
> ...and a couple Andres mentioned:
> - memory management: in
> https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> - node dispatch:
> https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
>
> Therefore, I would suggest that we use SSE2 only, because:
> - portability is very easy
> - to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

>
> When the PG16 cycle opens, I will work separately on ensuring the
> portability of using SSE2, so you can focus on other aspects.

Thanks!

> I think it would be a good idea to have both node16 and node32 for testing.
> During benchmarking we can delete one or the other and play with the
> other thresholds a bit.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32. The 'name' in the below result indicates the
kind of instruction set (AVX2 or SSE2) and the node type used. For
instance, sse2_4_32_48_256 means the radix tree has four kinds of node
types for each which have 4, 32, 48, and 256 pointers, respectively,
and use SSE2 instruction set.

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

name size attach
lookup
avx2_4_32_128_256 1154 MB 6742.53 ms 47765.63 ms
avx2_4_32_48_256 1839 MB 4239.35 ms 40528.39 ms
sse2_4_16_128_256 1154 MB 6994.43 ms 40383.85 ms
sse2_4_16_32_128_256 1154 MB 7239.35 ms 43542.39 ms
sse2_4_16_48_256 1839 MB 4404.63 ms 36048.96 ms
sse2_4_32_128_256 1154 MB 6688.50 ms 44902.64 ms

* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
10000000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);

name size attach lookup
avx2_4_32_128_256 1535 kB 1.85 ms 17427.42 ms
avx2_4_32_48_256 1472 kB 2.01 ms 22176.75 ms
sse2_4_16_128_256 1582 kB 2.16 ms 15391.12 ms
sse2_4_16_32_128_256 1535 kB 2.14 ms 18757.86 ms
sse2_4_16_48_256 1489 kB 1.91 ms 19210.39 ms
sse2_4_32_128_256 1535 kB 2.05 ms 17777.55 ms

The statistics of the number of each node types are:

* avx2_4_32_128_256 (dense and sparse)
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

* avx2_4_32_48_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n48 = 227, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n48 = 159, n256 = 50

* sse2_4_16_128_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n128 = 916914, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n128 = 256, n256 = 1

* sse2_4_16_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n32 = 285, n128 =
916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n32 = 48, n128 =
208, n256 = 1

* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50

* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

Observations are:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.

In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.

> Ideally, node16 and node32 would have the same code with a different
> loop count (1 or 2). More generally, there is too much duplication of
> code (noted by Andres in his PoC), and there are many variable names
> with the node size embedded. This is a bit tricky to make more
> general, so we don't need to try it yet, but ideally we would have
> something similar to:
>
> switch (node->kind) // todo: inspect tagged pointer
> {
> case RADIX_TREE_NODE_KIND_4:
> idx = node_search_eq(node, chunk, 4);
> do_action(node, idx, 4, ...);
> break;
> case RADIX_TREE_NODE_KIND_32:
> idx = node_search_eq(node, chunk, 32);
> do_action(node, idx, 32, ...);
> ...
> }
>
> static pg_alwaysinline void
> node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> {
> if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> // do simple loop with (node_simple *) node;
> else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> // do vectorized loop where available with (node_vec *) node;
> ...
> }
>
> ...and let the compiler do loop unrolling and branch removal. Not sure
> how difficult this is to do, but something to think about.

Agreed.

I'll update my patch based on your review comments and use SSE2.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-06-28 06:27:21 Re: margay fails assertion in stats/dsa/dsm code
Previous Message vignesh C 2022-06-28 06:17:53 Re: Handle infinite recursion in logical replication setup