From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date: | 2022-11-24 14:54:06 |
Message-ID: | CAD21AoCiiKz_sxNFAvKY_--7-O6ioOwr6Rf0aL71cQvccipiPQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Nov 21, 2022 at 6:30 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
> > >
> > > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > > node3: 5 + 3 + 3*8 = 32
> >
> > IIUC if we store the fanout member only in variable-sized nodes,
> > rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> > size doesn't change but there is 1 byte padding space.
>
> I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.
>
> > Also, even if we have the node3 a variable-sized node, size class 1
> > for node3 could be a good choice since it also doesn't need padding
> > space and could be a good alternative to path compression.
> >
> > node3 : 5 + 3 + 3*8 = 32 bytes
> > size class 1 : 5 + 3 + 1*8 = 16 bytes
>
> Precisely! I have that scenario in my notes as well -- it's quite compelling.
So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.
In this verification, I have all nodes except for node256
variable-sized nodes, and the sizes are:
radix tree node 1 : 6 + 4 + (6) + 1*8 = 24 bytes
radix tree node 4 : 6 + 4 + (6) + 4*8 = 48
radix tree node 15 : 6 + 32 + (2) + 15*8 = 160
radix tree node 32 : 6 + 32 + (2) + 32*8 = 296
radix tree node 61 : inner 6 + 256 + (2) + 61*8 = 752, leaf 6 +
256 + (2) + 16 + 61*8 = 768
radix tree node 128 : inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 +
256 + (2) + 16 + 128*8 = 1304
radix tree node 256 : inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32
+ 256*8 = 2088
I did some performance tests against two radix trees: a radix tree
supporting only fixed-size nodes (i.e. applying up to 0003 patch), and
a radix tree supporting variable-size nodes (i.e. applying all
attached patches). Also, I changed bench_search_random_nodes()
function so that we can specify the filter via a function argument.
Here are results:
Here are results:
* Query
select * from bench_seq_search(0, 1*1000*1000, false)
* Fixed-size
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 =
1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871216 | | 67 |
| 212 |
(1 row)
* Variable-size
NOTICE: num_keys = 1000000, height = 2, n1 = 0, n4 = 0, n15 = 0, n32
= 31251, n61 = 0, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871280 | | 74 |
| 212 |
(1 row)
---
* Query
select * from bench_seq_search(0, 2*1000*1000, true)
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1,
n256 = 245
* Fixed-size
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680848 | | 74 |
| 201 |
(1 row)
* Variable-size
NOTICE: num_keys = 999654, height = 2, n1 = 0, n4 = 1, n15 = 26951,
n32 = 35548, n61 = 1, n128 = 0, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 16009040 | | 85 |
| 201 |
(1 row)
---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF')
* Fixed-size
NOTICE: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603,
n128 = 182670, n256 = 1024
mem_allocated | search_ms
---------------+-----------
343001456 | 1151
(1 row)
* Variable-size
NOTICE: num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 =
138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024
mem_allocated | search_ms
---------------+-----------
230504328 | 1077
(1 row)
---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F')
* Fixed-size
NOTICE: num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 =
65536, n256 = 257
mem_allocated | search_ms
---------------+-----------
99911920 | 632
(1 row)
* Variable-size
NOTICE: num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0,
n32 = 0, n61 = 61747, n128 = 3789, n256 = 257
mem_allocated | search_ms
---------------+-----------
64045688 | 554
(1 row)
Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance. I'm going to check the load
performance as well.
I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v11-0004-Preparatory-refactoring-for-decoupling-kind-from.patch | application/x-patch | 19.0 KB |
v11-0003-tool-for-measuring-radix-tree-performance.patch | application/x-patch | 18.1 KB |
v11-0001-introduce-vector8_min-and-vector8_highbit_mask.patch | application/x-patch | 2.6 KB |
v11-0002-Add-radix-implementation.patch | application/x-patch | 85.8 KB |
v11-0005-Make-all-node-kinds-variable-sized.patch | application/x-patch | 24.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Kellerer | 2022-11-24 15:00:59 | Re: Patch: Global Unique Index |
Previous Message | Daniel Gustafsson | 2022-11-24 14:24:21 | Re: TAP output format in pg_regress |