Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-02-15 03:20:46
Message-ID: CAD21AoAiwz2Ehu9wyNrRQw1ZjWZp1RHdcqG16f0o40iW36kDWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> > > My todo:
> > > - benchmark tid store / vacuum again, since we haven't since varlen
> > > types and removing unnecessary locks.
>
> I ran a vacuum benchmark similar to the one in [1] (unlogged tables
> for reproducibility), but smaller tables (100 million records),
> deleting only the last 20% of the table, and including a parallel
> vacuum test. Scripts attached.
>
> monotonically ordered int column index:
>
> master:
> system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
> system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
> system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s
>
> v-59:
> system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
> system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
> system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s
>
> uuid column index:
>
> master:
> system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
> system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
> system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s
>
> v-59:
> system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
> system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
> system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s
>
> int & uuid indexes in parallel:
>
> master:
> system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
> system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
> system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s
>
> v59:
> system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
> system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
> system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s
>
> Over all, I'm pleased with these results, although I'm confused why
> sometimes with the patch the first run reports running faster than the
> others. I'm curious what others get. Traversing a tree that lives in
> DSA has some overhead, as expected, but still comes out way ahead of
> master.

Thanks! That's a great improvement.

I've also run the same scripts in my environment just in case and got
similar results:

monotonically ordered int column index:

master:
system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s
system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s

v-59:
system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s
system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s

uuid column index:

master:
system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s
system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s

v-59:
system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s
system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s
system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s
system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s

v-59:
system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s
system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s

>
> There are still some micro-benchmarks we could do on tidstore, and
> it'd be good to find out worse-case memory use (1 dead tuple each on
> spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

>
> > > I'm not sure what the test_node_types_* functions are testing that
> > > test_basic doesn't. They have a different, and confusing, way to stop
> > > at every size class and check the keys/values. It seems we can replace
> > > all that with two more calls (asc/desc) to test_basic, with the
> > > maximum level.
>
> v58-0008:
>
> + /* borrowed from RT_MAX_SHIFT */
> + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
> BITS_PER_BYTE) * BITS_PER_BYTE;
>
> This is harder to read than "64 - 8", and doesn't really help
> maintainability either.
> Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.
>
> + /* leaf nodes */
> + test_basic(test_info, 0);
>
> + /* internal nodes */
> + test_basic(test_info, 8);
> +
> + /* max-level nodes */
> + test_basic(test_info, max_shift);
>
> This three-way terminology is not very informative. How about:
>
> + /* a tree with one level, i.e. a single node under the root node. */
> ...
> + /* a tree with two levels */
> ...
> + /* a tree with the maximum number of levels */

Agreed.

>
> +static void
> +test_basic(rt_node_class_test_elem *test_info, int shift)
> +{
> + elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
> +
> + /* Test nodes while changing the key insertion order */
> + do_test_basic(test_info->nkeys, shift, false);
> + do_test_basic(test_info->nkeys, shift, true);
>
> Adding a level of indirection makes this harder to read, and do we
> still know whether a test failed in asc or desc keys?

Agreed, it seems to be better to keep the previous logging style.

>
> > > My earlier opinion was that "handle" was a nicer variable name, but
> > > this brings back the typedef and also keeps the variable name I didn't
> > > like, but pushes it down into the function. I'm a bit confused, so
> > > I've kept these not-squashed for now.
> >
> > I misunderstood your comment. I've changed to use a variable name
> > rt_handle and removed the TidStoreHandle type. 0013 patch.
>
> (diff against an earlier version)
> - pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
> + pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);
>
> Shall we use "handle" in vacuum_parallel.c as well?

Agreed.

>
> > > I'm pretty sure there's an
> > > accidental memset call that crept in there, but I'm running out of
> > > steam today.
>
> I have just a little bit of work to add for v59:
>
> v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> any bitmapwords. That can only happen if e.g. there is an offset > 128
> and there are none between 64 and 128, so not a huge deal but I think
> it's a bit nicer in this patch.

LGTM.

>
> > > > * WIP: notes about traditional radix tree trading off span vs height...
> > > >
> > > > Are you going to write it?
> > >
> > > Yes, when I draft a rough commit message, (for next time).
>
> I haven't gotten to the commit message, but:

I've drafted the commit message.

>
> v59-0004 - I did some rewriting of the top header comment to explain
> ART concepts for new readers, made small comment changes, and tidied
> up some indentation that pgindent won't touch
> v59-0005 - re-pgindent'ed

LGTM, squashed all changes.

I've attached these updates from v59 in separate patches.

I've run regression tests with valgrind and run the coverity scan, and
I don't see critical issues.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v60-ART.tar.gz application/gzip 55.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-02-15 03:35:34 RE: Synchronizing slots from primary to standby
Previous Message Andy Fan 2024-02-15 03:16:41 Re: make add_paths_to_append_rel aware of startup cost