Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-10-31 05:46:53
Message-ID: CAD21AoAT0p0crCjiY+UwXbkQG2bRNFbe0UF0Ba5EfC_w458JBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 27, 2022 at 12:21 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > True. I'm going to start with 6 bytes and will consider reducing it to
> > 5 bytes.
>
> Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes.
>
> > Encoding the kind in a pointer tag could be tricky given DSA
>
> If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory.
>
> > support so currently I'm thinking to pack the node kind and node
> > capacity classes to uint8.
>
> That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles.

Understood.

>
> > I've made some progress on investigating DSA support. I've written
> > draft patch for that and regression tests passed. I'll share it as a
> > separate patch for discussion with v8 radix tree patch.
>
> Great!
>
> > While implementing DSA support, I realized that we may not need to use
> > pointer tagging to distinguish between backend-local address or
> > dsa_pointer. In order to get a backend-local address from dsa_pointer,
> > we need to pass dsa_area like:
>
> I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind.
>
> Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed.

Updated to support Meson.

>
> > > +static inline void
> > > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > > + uint8 *dst_chunks, rt_node **dst_children, int count)
> > > +{
> > > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > > +}
> > >
> > > gcc generates better code with something like this (but not hard-coded) at the top:
> > >
> > > if (count > 4)
> > > pg_unreachable();
>
> Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to
>
> static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
>
> In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants.

I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.

In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL). Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,

select * from bench_seq_search(0, 1* 1000 * 1000, false, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871240 | 180000000 | 67 |
| 241 |
(1 row)

select * from bench_seq_search(0, 1* 1000 * 1000, false, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 14680064 | 180000000 | 81 |
| 483 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680872 | 179937720 | 74 |
| 235 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 23068672 | 179937720 | 86 |
| 445 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871240 | 180000000 | 67 |
| 640 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 14680064 | 180000000 | 81 |
| 1002 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680872 | 179937720 | 74 |
| 697 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 23068672 | 179937720 | 86 |
| 1030 |
(1 row)

In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().

Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v8-0001-introduce-vector8_min-and-vector8_highbit_mask.patch application/octet-stream 2.6 KB
v8-0004-PoC-DSA-support-for-radix-tree.patch application/octet-stream 56.1 KB
v8-0002-Add-radix-implementation.patch application/octet-stream 82.8 KB
v8-0003-tool-for-measuring-radix-tree-performance.patch application/octet-stream 15.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Regina Obe 2022-10-31 05:55:05 RE: [PATCH] Support % wildcard in extension upgrade filenames
Previous Message Michael Paquier 2022-10-31 05:42:03 Commit fest 2022-11