Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-11-24 14:54:06
Message-ID: CAD21AoCiiKz_sxNFAvKY_--7-O6ioOwr6Rf0aL71cQvccipiPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 21, 2022 at 6:30 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
> > >
> > > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > > node3: 5 + 3 + 3*8 = 32
> >
> > IIUC if we store the fanout member only in variable-sized nodes,
> > rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> > size doesn't change but there is 1 byte padding space.
>
> I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.
>
> > Also, even if we have the node3 a variable-sized node, size class 1
> > for node3 could be a good choice since it also doesn't need padding
> > space and could be a good alternative to path compression.
> >
> > node3 : 5 + 3 + 3*8 = 32 bytes
> > size class 1 : 5 + 3 + 1*8 = 16 bytes
>
> Precisely! I have that scenario in my notes as well -- it's quite compelling.

So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.

In this verification, I have all nodes except for node256
variable-sized nodes, and the sizes are:

radix tree node 1 : 6 + 4 + (6) + 1*8 = 24 bytes
radix tree node 4 : 6 + 4 + (6) + 4*8 = 48
radix tree node 15 : 6 + 32 + (2) + 15*8 = 160
radix tree node 32 : 6 + 32 + (2) + 32*8 = 296
radix tree node 61 : inner 6 + 256 + (2) + 61*8 = 752, leaf 6 +
256 + (2) + 16 + 61*8 = 768
radix tree node 128 : inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 +
256 + (2) + 16 + 128*8 = 1304
radix tree node 256 : inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32
+ 256*8 = 2088

I did some performance tests against two radix trees: a radix tree
supporting only fixed-size nodes (i.e. applying up to 0003 patch), and
a radix tree supporting variable-size nodes (i.e. applying all
attached patches). Also, I changed bench_search_random_nodes()
function so that we can specify the filter via a function argument.
Here are results:

Here are results:

* Query
select * from bench_seq_search(0, 1*1000*1000, false)

* Fixed-size
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 =
1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871216 | | 67 |
| 212 |
(1 row)

* Variable-size
NOTICE: num_keys = 1000000, height = 2, n1 = 0, n4 = 0, n15 = 0, n32
= 31251, n61 = 0, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871280 | | 74 |
| 212 |
(1 row)

---
* Query
select * from bench_seq_search(0, 2*1000*1000, true)
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1,
n256 = 245
* Fixed-size
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680848 | | 74 |
| 201 |
(1 row)

* Variable-size
NOTICE: num_keys = 999654, height = 2, n1 = 0, n4 = 1, n15 = 26951,
n32 = 35548, n61 = 1, n128 = 0, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 16009040 | | 85 |
| 201 |
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF')

* Fixed-size
NOTICE: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603,
n128 = 182670, n256 = 1024
mem_allocated | search_ms
---------------+-----------
343001456 | 1151
(1 row)

* Variable-size
NOTICE: num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 =
138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024
mem_allocated | search_ms
---------------+-----------
230504328 | 1077
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F')
* Fixed-size
NOTICE: num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 =
65536, n256 = 257
mem_allocated | search_ms
---------------+-----------
99911920 | 632
(1 row)
* Variable-size
NOTICE: num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0,
n32 = 0, n61 = 61747, n128 = 3789, n256 = 257
mem_allocated | search_ms
---------------+-----------
64045688 | 554
(1 row)

Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance. I'm going to check the load
performance as well.

I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v11-0004-Preparatory-refactoring-for-decoupling-kind-from.patch application/x-patch 19.0 KB
v11-0003-tool-for-measuring-radix-tree-performance.patch application/x-patch 18.1 KB
v11-0001-introduce-vector8_min-and-vector8_highbit_mask.patch application/x-patch 2.6 KB
v11-0002-Add-radix-implementation.patch application/x-patch 85.8 KB
v11-0005-Make-all-node-kinds-variable-sized.patch application/x-patch 24.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Kellerer 2022-11-24 15:00:59 Re: Patch: Global Unique Index
Previous Message Daniel Gustafsson 2022-11-24 14:24:21 Re: TAP output format in pg_regress